Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Size: px
Start display at page:

Download "Probabilistic Anti-Spam Filtering with Dimensionality Reduction"

Transcription

1 Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of communication is the massive spam message delivery Everyday billion of unwanted messages are sent by spammers and this number does not stop growing Helpfully there are different approaches able to automatically detect and remove most of these messages and a well-known ones are based on Bayesian decision theory However many machine learning techniques applied to text categorization have the same difficulty: the high dimensionality of the feature space Many term selection methods have been proposed in the literature Nevertheless it is still unclear how the performance of naive Bayes anti-spam filters depends on the methods applied for reducing the dimensionality of the feature space In this paper we compare the performance of most popular methods used as term selection techniques with some variations of the original naive Bayes anti-spam filter Categories and Subject Descriptors I5 [Pattern Recognition]: Applications; I27 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms Anti-spam Filtering Keywords Terms selection technique naive Bayes anti-spam filter dimensionality reduction 1 INTRODUCTION The term spam is generally used to denote an unsolicited commercial Spam messages are annoying to most users because they clutter their mailboxes This problem can be quantified in economical terms since many hours are wasted everyday by workers It is not just the time they Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and/or a fee SAC 10 March Sierre Switzerland Copyright 2010 ACM X-XXXXX-XX-X/XX/XX $500 waste reading the spam but also the time they spend deleting those messages According to a report published by McAfee in March the cost in lost productivity per day per user is approximate equal to $050 based on the users having to spend 30 seconds for dealing with two spam messages each day and the users spam filter working at 95 percent accuracy (average accuracy achieved by the majority of available anti-spam filters) Therefore the productivity loss per employee per year due to spam is approximate equal to $18250 Supposing a company with 1000 workers earning $30 per hour it would suffer $ per year in lost productivity This works out to more than $41000 per 1 percent of spam allowed into a company Many methods have been proposed to automatic classify message [3] However among all proposed techniques machine learning algorithms have been achieved more success [1 3] These methods include approaches that are considered top-performers in text categorization like Boosting Support Vector Machines (SVM) and naive Bayes classifiers[1 5] The latter currently appear to be very popular in commercial and open-source spam filters This is probably due to their simplicity which makes them easy to implement their linear computational complexity and their accuracy which in spam filtering is comparable to that of more elaborate learning algorithms [5] A major difficulty of dealing with text categorization problems using approaches based on Bayesian probability is the high dimensionality of the feature space [1] The native feature space consists of unique terms (characters or words) that occur in messages which can be tens or hundreds of thousands of terms even for a moderate-sized e- mail collection This is prohibitively high for the most of learning algorithms (exceptions are k-nearest neighbors and SVM) Hence it is highly desirable to reduce the native space without sacrificing categorization accuracy and it is also desirable to achieve such a goal automatically [4 8] In this paper we present a comparative study of the five most used term selection techniques with four variants of the original naive Bayes algorithm for anti-spam filtering in order to examine how the term selection techniques effect the categorization accuracy of different anti-spam filters based on the Bayesian decision theory The remainder of this paper is organized as follows: Section 2 presents details of term selection techniques The naive Bayes anti-spam filters are described in Section 3 Section 4 presents the performance measurements used for comparing the achieved results Section 5 describes the methodology we employ in our experiments Experimental results are showed in Section 6 Finally Section 7 offers conclusions and future works 1 See reports/mar_spam_reportpdf

2 2 DIMENSIONALITY REDUCTION In text categorization the high dimensionality of the term space (T ) may be problematic In fact many algorithms used for classifier can not scale to high values of T As a consequence a pass of dimensionality reduction is often applied before classifier whose effect is to reduce the size of the vector space from T to T T ; the set T is called the reduced term set [7] Techniques for term selection attempt to select from the original set T the subset T of terms (with T T ) that when used for document indexing yields the highest effectiveness For selecting the best terms we have to use a function that selects and ranks terms according to how good they are A computationally easier alternative is keeping the T T terms that receive the highest score according to a function that measures the importance of the term for the text categorization task 21 Representation Assuming that each message m is composed by a set of terms m = t 1 t n where each term t k corresponds to a word ( adult for example) a set of words ( to be removed ) or a single character ( $ ) we can represent each message by a vector x = x 1 x n where x 1 x n are values of the attributes X 1 X n associated with the terms t 1 t n In the simplest case each term represents a single word and all attributes are Boolean: X i = 1 if the message contains t i or X i = 0 otherwise Alternatively attributes may be integer values computed by term frequencies (TF) representing how many times each term occurs in the message A third alternative is to associate each attribute X i to a normalized TF x i = t i(m) m where t i(m) is the number of occurrences of the term represented by X i in m and m is the length of m measured in term occurrences Normalized TF takes into account the term repetition versus the size of message [1] 22 Term selection techniques In the following we describe the five most used Term Space Reduction (TSR) techniques Probabilities are interpreted on an event space of documents (for example P( t k c i) denotes the probability that for a random message m term t k does not occur in m and m belongs to category c i) and are estimated by counting occurrences in the training set Tr Since there are only two categories spam (c s) and legitimate (c l ) some functions are specified locally to a specific category; in order to assess the value of a term t k in a global category-independent sense either the sum f sum(t k ) = P c i=1 f(t k c i) the weighted sum f wsum(t k ) = P c i=1 P(ci)f(t k c i) or maximum f max(t k ) = max c i=1 f(t k c i) of their category-specific values f(t k c i) are computed 221 Document frequency (DF) DF is given by the frequency of messages with a term t k in the training set Tr as given by DF(t k ) = Trt k Tr where Tr tk represents the number of messages containing the term t k in the training set Tr and Tr is the amount of available messages [8] 222 Information gain (IG) IG is frequently employed as a term-goodness criterion in the field of machine learning [6] It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a message [4] The IG of a term t k is computed according to IG(t k ) = P P(tc) c [c i c i ] Pt [t k t k ] P(t c)log P(t)P(c) 223 Mutual information (MI) MI is a criterion commonly used in statistical language modeling of words associations and related applications [8] The mutual information criterion between t k and c i is defined as MI(t k c i) = log P(t kc i ) P(t k )P(c i ) MI(t k c i) has a natural value of zero if t k and c i are independent To measure the goodness of a term in a global feature selection we can combine category-specific scores of a term into the three alternate ways: f sum f wsum or f max 224 χ 2 statistic χ 2 measures the lack of independence between the term t k and the class c i It can be compared to the χ 2 distribution with one degree of freedom to judge extremeness χ 2 statistic has a natural value of zero if t k and c i are independent We can calculate the χ 2 statistic for the term t k in the class c i by χ 2 (t k ) = Tr [P(t kc i )P( t k c i ) P(t k c i )P( t k c i )] 2 P(t k )P( t k )P(c i )P( c i ) 225 Odds ratio (OR) OR measures the ratio between the odds of the term appearing in a relevant document versus the odds of it appearing in a nonrelevant one The odds ratio between t k and c i is given by OR(t k c i) = P(t kc i )(1 P(t k c i )) (1 P(t k c i ))P(t k c i ) An OR of 1 indicates that term t k is equally likely in both classes c i and c i OR greater than 1 indicates that t k is more likely in c i On the other hand OR less than 1 indicates that t k is less likely in c i As MI to measure the goodness of a term in a global feature selection we can combine category-specific scores using functions f sum f wsum or f max 3 NAIVE BAYES ANTI-SPAM FILTERS Naive Bayes anti-spam filtering (NB) has become the most popular mechanism to distinguish spam messages from legitimate [5] From Bayes theorem and the theorem of total probability the probability for a message with vector x = x 1 x n belongs to a category c i {c s c l } is P(c i x) = P(c i)p( x c i ) P( x) Since the denominator does not depend on the category the filter classifies each message in category that maximizes P(c i)p( x c i) In spam filtering it is equivalent to classifying P(c s)p( x c s) P(c s)p( x c s)+p(c l )P( x c l ) > a message as spam (c s) whenever T with T = 05 By varying T we can opt for more true negatives at the expense of fewer true positives or viceversa The a priori probabilities P(c i) can be estimated as the occurrences frequency of documents belonging to the category c i in the training set Tr On the other hand P( x c i) is practically impossible to estimate directly because we would need in Tr some messages identical to the one we want to classify However NB classifier makes a simple assumption that the terms in a message are conditionally independent and the order they appear in a message is irrelevant The probabilities P( x c i) are estimated differently in each NB version [5] Several studies have found the NB classifier to be surprisingly effective in the spam filtering task [1 5] despite the fact that its independence assumption is usually oversimplistic In the following we describe four well-known versions of NB anti-spam filter available in the literature 31 Multinomial term frequency naive Bayes The multinomial term frequency NB (MN TF NB) represents each message as a set of terms m = {t 1 t n} computing each one of t k as how many times it appears in m In this sense m can be represented by a vector x = x 1 x n where each x k corresponds to the number of occurrences of t k in m Moreover each message m of

3 category c i can be interpreted as the result of picking independently m terms from T with replacement and probability P(t k c i) for each t k Hence P( x c i) is the multinomial distribution: P( x c i) = P( m ) m! ny k=1 P(t k c i) x k x k! The criterion for classifying a message as spam becomes: P(c s) Q n k=1 P(t k c s) x k Pc i {c sc l } P(ci) Q n k=1 P(t k c i) x k > T and probabilities P(t k c i) are estimated as a Laplacian prior P(t k c i) = 1+Nt k c i n+n ci where N tk c i is the number of occurrences of term t k in the training messages of category c i and N ci = P n k=i Nt kc i 32 Multinomial Boolean naive Bayes The multinomial Boolean NB (MN Boolean NB) is similar to the MN TF NB including the estimates of P(t k c i) except that each attribute x k is Boolean 33 Multivariate Bernoulli naive Bayes Let T = {t 1 t n} the result set of terms after term selection The multivariate Bernoulli NB (MV Bernoulli NB) represents each message m as a set of terms only by computing the presence or absence of each term Therefore m can be represented as a binary vector x = x 1 x n where each x k shows whether or not t k will occur in m The probabilities P( x c i) are computed by [5]: ny P( x c i) = P(t k c i) x k (1 P(t k c i)) (1 xk) k=1 The criterion for classifying a message as spam becomes: P(c s) Q n k=1 P(t k c s) x k (1 P(t k c s)) (1 x k) Pc i {c sc l } P(ci) Q n k=1 P(t k c i) x k (1 P(tk c i)) (1 x k) > T where probabilities P(t k c i) are estimated as a Laplacian prior P(t k c i) = 1+ Trt k c i 2+ Tr ci where Tr tk c i is the number of training messages of category c i that contain the term t k and Tr ci is the total number of training messages of category c i 34 Flexible Bayes Flexible Bayes (FB) represents the probabilities P( x c i) as the average of L kci normal distributions with different values for µ kci but the same one for σ kci : P(x k c i) = 1 L X kci g(x k ; µ kci l σ ci ) L kci l=1 where L kci is the amount of different values that the attribute X k has in the training set Tr of category c i Each of these values is used as µ kci l of a normal distribution of the category c i However all distributions of a category c i 1 are taken to have the same σ ci = [1] Trci 4 PERFORMANCE MEASUREMENTS Let S and L sets of spam and legitimate messages respectively the possible prediction results are: true positives (T P) corresponding to the set of spam messages correctly classified true negatives (T N) the set of legitimate messages correctly classified false negatives (FN) the set of spam messages incorrectly classified as legitimate and false positives (FP) the set of legitimate messages incorrectly classified as spam Some well-known evaluation measurements are: True positive rate (Tpr) True negative rate (Tnr) Spam precision (Spr) Legitimate precision (Lpr) ROC curves [3] precision recall [5] Accuracy rate (Acc) and Total Cost Ratio (TCR) [1] T CR was first proposed by [1] which offers an indication of the improvement provided by the filter Greater T CR indicates better performance and for TCR < 1 not using the filter is better However we opt to use the Matthews Correlation Coefficient [2]: MCC = ( T P T N ) ( FP FN ) p ( T P + FP )( T P + FN )( T N + FP )( T N + FN ) since it provides more information than T CR It returns a real value between 1 and +1 A coefficient equals to +1 indicates a perfect prediction; 0 an average random prediction; and 1 an inverse prediction According to [2] MCC is one of the best measure which may often provides a more balanced evaluation of the prediction than other measures such as the proportion of correct predictions (accuracy) especially if the two classes are of very different sizes 5 METHODOLOGY We use the six well-known Enron corpora (E1 E6) [5] in our experiments In order to provide an aggressive dimensionality reduction we sort the messages according to their arrival s date and we perform the training stage by using the 90% of the first received messages The remainder ones are separated for classifying After collecting all terms of the training messages we remove all irrelevant terms In this case we consider as irrelevant all terms which appear in less than five training messages (less than 01% of each corpus) Once the training stage has finished we apply a term selection technique for reducing the term space dimensionality In order to provide a complete evaluation we vary the number of terms from 10% to 100% of all retained terms in the preprocessing stage Next we classify the testing messages using the naive Bayes anti-spam filters presented in Section 3 We perform all possible combinations between NB anti-spam filters and term selection methods The first studies in naive Bayes anti-spam filters employed an unbalanced classification using cost-sensitive learning by varying the value of threshold T depending on the usage scenario However recent works have showed that the choice of T is very difficulty because it can vary for each user The same works indicate that setting T = 05 seems reasonable and it generally offers a good average prediction [5] 6 EXPERIMENTAL RESULTS In the following we present the results achieved for each corpus In the remainder of this section consider the following abbreviations: MN Boolean NB as MN Bool MN term frequency NB as MN TF MV Bernoulli NB as Bern and flexible Bayes as FB Due to paper limitations we present the best results and the best average prediction achieved by each NB classifier using different term selection techniques (TST) We define

4 the best result as the maximum MCC attained by the combination of one NB classifier and TST using a specific number of terms 2 Tables and 11 present the best performance achieved by each classifier for each Enron dataset Tables and 12 detail the set of results attained by the two classifiers which accomplished the best performance Table 1: E1 The best result achieved by filter Bern OR wsum MN Bool OR wsum MN TF OR wsum FB IG Table 2: E1 Two best individual performances Measurement Bern & OR wsum MN Bool & OR wsum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Note that in Table 2 the combination of MV Bernoulli NB with OR wsum acquired the same TCR than MN Boolean NB with OR wsum However their MCC are differents It happens because the MCC offers a more balanced evaluation of the prediction particularly if the classes are of different sizes as discussed in Section 4 The E1 dataset has more legitimate messages than spams hence false negatives affect the classifiers performance more than false positives Table 3: E2 The best result achieved by filter Bern OR sum MN TF OR sum MN Bool OR sum FB IG Table 8 shows another drawback of TCR MV Bernoulli NB with IG using 20% of T achieved a perfect prediction (MCC = 1000 and TCR = + ) for Enron 4 On the other hand MN Boolean NB with OR wsum using 40% of T classified only one spam incorrectly as legitimate ( FP = 0 FN = 1) and accomplished MCC = 0996 and TCR = 450 If we analyze only the TCR we would wrong conclude that the first combination was much better than the second one The results indicate that the classifiers generally worsen their performance when the complete set of terms T is used for classifying However we found a trade off between 40% 60% of T which usually achieve the best performance 2 A comprehensive set of results including all tables and figures is available at Table 4: E2 Two best individual performances Measurement Bern & OR sum MN TF & OR sum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Table 5: E3 The best result achieved by filter Bern IG MN Bool IG MN TF IG FB MI max Table 6: E3 Two best individual performances Measurement Bern & IG MN Bool & IG T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Table 7: E4 The best result achieved by filter Bern IG MN Bool OR wsum MN TF NB OR wsum FB OR wsum Table 8: E4 Two best individual performances Measurement Bern & IG MN Bool & OR wsum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Even a set of selected terms composed by only 10% 30% of T offers better results than a set with all terms of T Collectively regarding the term selection techniques the reported experiments indicate that {OR IG} > {χ 2 DF } >> MI where > means performs better than However we found that the IG and χ 2 are less sensitive to the variation of T On the other hand MI presented the

5 Table 9: E5 The best result achieved by filter Bern OR sum MN Bool OR wsum MN TF OR sum FB χ Table 10: E5 Two best individual performances Measurement Bern & OR sum MN Bool & OR wsum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Table 11: E6 The best result achieved by filter Bern OR sum MN Bool OR sum FB IG MN TF OR sum Table 12: E6 Two best individual performances Measurement Bern & OR sum MN Bool & OR sum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC worst individual and average performance We also verify that the NB classifiers performance is highly sensitive to the quality of terms selected by the TSTs and the number of selected terms T With respect to the classifiers the results indicate that {MV Bernoulli NB} > {MN Boolean NB MN term frequency NB} > {flexible Bayes} MV Bernoulli NB acquired the best individual performance for the majority of datasets used in our experiments It should be emphasized that MV Bernoulli NB is the unique approach which takes into account the absence of terms in the messages Our experiments also show that filters which uses Boolean attributes achieve better results than term frequencies ones 7 CONCLUSIONS AND FURTHER WORK In this paper we present an evaluation of feature selection methods in dimensionality reduction for anti-spam filtering domain We performed the comparison of performance achieved by four NB anti-spam filters with different kind of representation applied to classify messages as legitimate or spam from six public large and real datasets after a pass of dimensionality reduction employed by five term selection techniques varying the number of selected terms Regarding TSTs we found OR and IG most effective in aggressive term removal without losing categorization accuracy On the other hand the employment of MI generally offers poor results which frequently worsen the classifier s performance The results also indicate that IG and χ 2 are less sensitive to the variation of T Among of all classifiers MV Bernoulli NB achieved best performance The results also verify that Boolean attributes performs better than the term frequency ones We also confirm that the performance of NB classifiers is highly sensitive to the selected attributes and the number of selected terms by the TSTs in the training stage Better the term selection technique better the classifiers prediction Future works should take into consideration that spam filtering is a coevolutionary problem because while the filter tries to evolve its prediction capacity the spammers try to evolve their spam messages in order to overreach the classifiers An efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features In this way collaborative filters can be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers performance Moreover spammers generally insert a large amount of noises in spam messages in order to difficult the probability estimation Thus the filters should have a flexible way to compare the terms in the classifying task Approaches based on fuzzy logic can be employed to make the comparison more flexible 8 REFERENCES [1] I Androutsopoulos G Paliouras and E Michelakis Learning to filter unsolicited commercial Technical Report 2004/2 National Centre for Scientific Research Demokritos Athens Greece March 2004 [2] P Baldi S Brunak Y Chauvin C Andersen and H Nielsen Assessing the accuracy of prediction algorithms for classification: an overview Bioinformatics 16(5): May 2000 [3] G Cormack spam filtering: A systematic review Foundations and Trends in Information Retrieval 1(4): [4] G Forman and E Kirshenbaum Extremely fast text feature extraction for classification and indexing In Proceedings of 17th ACM Conference on Information and Knowledge Management pages Napa Valley CA USA November 2008 [5] V Metsis I Androutsopoulos and G Paliouras Spam filtering with naive bayes - which naive bayes? In Proceedings of the 3rd International Conference on and Anti-Spam pages 1 5 Mountain View CA USA July 2006 [6] T Mitchell Machine Learning McCraw-Hill 1997 [7] F Sebastiani Machine learning in automated text categorization ACM Computing Surveys 34(1):1 47 March 2002 [8] Y Yang and J Pedersen A comparative study on feature selection in text categorization In Proceedings of the 14th International Conference on Machine Learning pages Nashville TN USA July 1997

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM Assosiate professor, PhD Evgeniya Nikolova, BFU Assosiate professor, PhD Veselina Jecheva,

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Bayesian Spam Detection

Bayesian Spam Detection Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Probabilistic Learning Classification using Naïve Bayes

Probabilistic Learning Classification using Naïve Bayes Probabilistic Learning Classification using Naïve Bayes Weather forecasts are usually provided in terms such as 70 percent chance of rain. These forecasts are known as probabilities of precipitation reports.

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Assignment 0: Admin 1 late day to hand it in tonight, 2 late days for Wednesday. Assignment 1 is out: Due Friday of next week.

More information

Classification. Slide sources:

Classification. Slide sources: Classification Slide sources: Gideon Dror, Academic College of TA Yaffo Nathan Ifill, Leicester MA4102 Data Mining and Neural Networks Andrew Moore, CMU : http://www.cs.cmu.edu/~awm/tutorials 1 Outline

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. Supervised Learning. Manfred Huber Machine Learning Supervised Learning Manfred Huber 2015 1 Supervised Learning Supervised learning is learning where the training data contains the target output of the learning system. Training data D

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Final Report - Smart and Fast Sorting

Final Report - Smart and Fast  Sorting Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Chapter III.2: Basic ranking & evaluation measures

Chapter III.2: Basic ranking & evaluation measures Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Probabilistic Classifiers DWML, /27

Probabilistic Classifiers DWML, /27 Probabilistic Classifiers DWML, 2007 1/27 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing Asif Dhanani Seung Yeon Lee Phitchaya Phothilimthana Zachary Pardos Electrical Engineering and Computer Sciences

More information

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Keywords : Bayesian,  classification, tokens, text, probability, keywords. GJCST-C Classification: E.5 Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Collaborative Spam Mail Filtering Model Design

Collaborative Spam Mail Filtering Model Design I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme

More information

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University

More information

Data Preprocessing. Supervised Learning

Data Preprocessing. Supervised Learning Supervised Learning Regression Given the value of an input X, the output Y belongs to the set of real values R. The goal is to predict output accurately for a new input. The predictions or outputs y are

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Rita McCue University of California, Santa Cruz 12/7/09

Rita McCue University of California, Santa Cruz 12/7/09 Rita McCue University of California, Santa Cruz 12/7/09 1 Introduction 2 Naïve Bayes Algorithms 3 Support Vector Machines and SVMLib 4 Comparative Results 5 Conclusions 6 Further References Support Vector

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Analysis of Naïve Bayes Algorithm for Spam Filtering across Multiple Datasets

Analysis of Naïve Bayes Algorithm for  Spam Filtering across Multiple Datasets IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets To cite this article: Nurul Fitriah Rusland

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Prototype Selection for Handwritten Connected Digits Classification

Prototype Selection for Handwritten Connected Digits Classification 2009 0th International Conference on Document Analysis and Recognition Prototype Selection for Handwritten Connected Digits Classification Cristiano de Santana Pereira and George D. C. Cavalcanti 2 Federal

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Remote Sensing & Photogrammetry W4. Beata Hejmanowska Building C4, room 212, phone:

Remote Sensing & Photogrammetry W4. Beata Hejmanowska Building C4, room 212, phone: Remote Sensing & Photogrammetry W4 Beata Hejmanowska Building C4, room 212, phone: +4812 617 22 72 605 061 510 galia@agh.edu.pl 1 General procedures in image classification Conventional multispectral classification

More information

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Machine Learning for. Artem Lind & Aleskandr Tkachenko Machine Learning for Object Recognition Artem Lind & Aleskandr Tkachenko Outline Problem overview Classification demo Examples of learning algorithms Probabilistic modeling Bayes classifier Maximum margin

More information

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

SPAM, generally defined as unsolicited bulk  (UBE) Feature Construction Approach for  Categorization Based on Term Space Partition Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu (fcdh@stanford.edu), CS 229 Fall 2014-15 1. Introduction and Motivation High- resolution Positron Emission Tomography

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES Narsaiah Putta Assistant professor Department of CSE, VASAVI College of Engineering, Hyderabad, Telangana, India Abstract Abstract An Classification

More information

Predict Employees Computer Access Needs in Company

Predict Employees Computer Access Needs in Company Predict Employees Computer Access Needs in Company Xin Zhou & Wenqi Xiang Email: xzhou15,wenqi@stanford.edu 1.Department of Computer Science 2.Department of Electrical Engineering 1 Abstract When an employee

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

A Reputation-based Collaborative Approach for Spam Filtering

A Reputation-based Collaborative Approach for Spam Filtering Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Feature weighting classification algorithm in the application of text data processing research

Feature weighting classification algorithm in the application of text data processing research , pp.41-47 http://dx.doi.org/10.14257/astl.2016.134.07 Feature weighting classification algorithm in the application of text data research Zhou Chengyi University of Science and Technology Liaoning, Anshan,

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information

Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition

Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition Rome, June 2017 Enaitz Ezpeleta, Iñaki Garitano, Ignacio Arenaza-Nuño, José Marı a Gómez Hidalgo, and Urko

More information

Unknown Malicious Code Detection Based on Bayesian

Unknown Malicious Code Detection Based on Bayesian Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai

More information

A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems

A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems Tiejian Luo tjluo@ucas.ac.cn Zhu Wang wangzhubj@gmail.com Xiang Wang wangxiang11@mails.ucas.ac.cn ABSTRACT The indexing

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information