Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI
|
|
- Scot McGee
- 6 years ago
- Views:
Transcription
1 University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI University of Houston Clear Lake - School of Science & Computer Engineering 2700 Bay Area Blvd Houston, Texas T
2 Spam Filtering - University of Houston Clear Lake School of Science & Computer Engineering An Alternative Approach Table of Contents I. Abstract - Brief overview of the spam filtering problem statement, approach taken, and results / findings II. Introduction - Background and history of spam filtering, as well as brief highlight of proposed techniques III. Objectives - Goals of the project, in regards to techniques and desired results in quantitative terms IV. Methods and Procedures - In depth explanation of the methods involved in the spam filtering of the project V. Results and Discussion - Results of the above techniques and discussion of the outcome VI. References - Sources referencing the material described in the introduction and other portions of the project VII. Appendices - Project Poster, Data, etc Abstract Due to the ever increasing amount of unsolicited , commonly referred to as spam, techniques have arose for combating such messages. Although prevalent, Bayesian filters often misclassify legitimate . We provide a supervised neural network approach to the filtering problem. This technique shows promising results, with zero misclassification on a corpus of moderate size. Introduction The notion of effective spam filtering has long since been a problem. The difficulty lies in the necessity of substantially small false rejection rates, as the misclassification of a valid is rarely tolerable. Because senders of unsolicited ( spammers ) very often disguise their messages to appear as valid correspondence, perfect filtering is impossible. Borrowing terminology from the field of biometrics, it would be possible to calculate the equal error rate of a given test set. From this value, we could produce a system that would be capable of balancing the false rejection rate with the false acceptance rate [1]. However, this approach will typically yield unacceptable results, as false rejections come as a much higher cost to the user - the result of missing an important usually outweighs the improper acceptance of a particular spam message. Therefore, a virtually non-existent false rejection rate is necessary, so we instead focus on reducing the positive acceptance rate as much as possible while maintaing a minimal false rejection rate. Project Report 1
3 Traditional spam filters use Bayesian techniques; that is, statistical methods based on Bayes Theorem. As described by Obied, the probability that an message belongs to a particular class can be calculated using Bayes Theorem as follows: where each class C represents either the class of spam messages or the class of non-spam messages, and each fi represents a particular feature used in classifying messages [2]. Most Bayesian spam filters assume no a priori knowledge; initial probabilities are unknown. Through training, these filters can learn to differentiate effectively between spam and non-spam messages. An alternative approach to Bayesian techniques, as suggested by this paper, involves machine learning through the use of a supervised neural network. Other machine learning techniques have recently been attempted, including variants of k- Nearest-Neighbor (k-nn) classifiers [3]. An advantage of these techniques is that they commonly make use of an adjustable threshold, allowing the sensitivity of the filter to be modified at will. The approach that we present here does not provide such options, rather, we seek to prove that our method will perform filtering with enough precision to eliminate the need for such adjustments. For practically any spam filtering scheme, we must build and maintain databases containing the frequency with which specific words typically appear in both spam and ham (legitimate) messages. One problem arises when determining how to efficiently store and search for these words within our databases. A common practice, frequently used in Bayesian filtering, is to maintain two separate hash tables, one for each type of spam/ham mail classification. Each entry appearing in the hash tables consists of a key-value pair: a specific word paired with the number of occurrences (for a specific classification) found during the training phase. Here, the term training phase refers to the training phase of Bayesian filters, which may continue even after the initial setup. For our purposes, we will consider this process part of the preprocessing, and will reserve the term training to refer explicitly to the training of the neural network itself. Once these databases have been built, either from publicly found corpora or from past personal s, the neural network can be trained. To implement the supervised learning approach, we require access to an adequate number of properly classified messages. From this extended corpus, we will extract a minimal number of features. For each message, will will determine a spam score and a ham score. Each word in the message is compared to both databases. If a particular word appears in a database, the frequency of the word (stored in the hash table of the particular database) will be added to the appropriate spam or ham score for this message. Therefore, the spam score for a given message is directly related to the frequency that the message s words appear in the spam database. The same is true for the ham Project Report 2
4 score for a particular message. Using only the spam score and the ham score of each , we seek to provide effective classification of the messages. Objectives We attempt to show that the spam filtering problem can be simplified into a two-dimensional classification problem, requiring only the spam and ham word frequencies for each message. We seek to prove that a neural network can perform the non-linear partitioning of this feature space. The purpose of this paper is not only to show that a neural network is indeed feasible, but also to demonstrate the accuracy of such a method. We provide quantitative results, supporting our claims. Methods & Procedures The first step in any spam filtering process is the acquisition of an adequate sized corpus. Although several public corpora are publicly available, the majority of these packages do not differentiate between which s are spam and which are legitimate messages [4]. Many other compilations include a large number of spam messages, but do not include any ham whatsoever. For this reason, the ham corpus used consisted primarily of personal s collected over the last few months. The spam corpus used is freely and publicly available at with over 500 spam messages included in the creation of our databases. The ham corpus contained only ham messages in the emlx format, requiring a conversion utility. A small program, written in AppleScript, was used to convert the messages to a UTF-8 format, the standard format recommended by the Internet Mail Consortium (IMC) [5]. Once both sets of messages were successfully stored in this common format, we proceeded. To create and maintain the hash tables necessary to store the respective ham and spam word counts, we modified an open source perl script. The script performed the database creation, saving the hash tables in long term storage. All of the spam messages were combined into a single file, as were all of the ham messages. We proceeded by calling the script with each file, signifying the appropriate classification in each case. A total of 500 spam messages were placed in the database, resulting in 699,100 unique words. Similarly, a total of 282 ham messages resulted in the storage of 294,531 unique words. This process is shown in Appendix A. After the hash tables were built, we again modified the perl script to allow a list of files to be passed as an argument, together with a flag signifying the appropriate classification. For each file in the list, we calculated the spam and ham score, as described in the previous section. These values, together with a boolean flag signifying the proper classification, were written to an output file. This task completed the preprocessing portion of the project. Project Report 3
5 Using only two values as input (spam and ham scores), we expect the plot of these points to be non-linearly separable. This characteristic can be seen in the generalized Venn diagram shown below, and is further exemplified by observing the overlap present in the plot of the actual input values: Distribution Ham? Ham Score 120, , , ,500 94,000 87,500 81,000 74,500 68,000 61,500 55,000 SPAM 55,000 79, ,000 Spam Ham 128, , , , ,500 Spam 251, , ,000 To ensure correct classification, we must ensure that the type of network chosen is capable of non-linear separation. For this reason, a feed-forward neural network was chosen, using the backpropagation algorithm for training. To implement the neural network, the software package EasyNN-plus was used. The output file from the preprocessing stage was used as input to the network, after ordering of the data was randomized. As an initial approach, a single hidden layer was used. The program was allowed to specify the most appropriate choice for the number of nodes in the hidden layer, resulting in a total of five nodes. Thus, the network topology became a feedforward network. A visualization of the network topology is given in Appendix B. Project Report 4
6 For the training, the learning rate was initially set to 0.80 and allowed both to decay and to be optimized. The momentum was set to zero, as there was no ordering of the data. We signified that the first 400 values should be used for training. We reserved 71 values for validation purposes, leaving 300 values for testing. Training took place over 27,916 epochs, and took approximately 45 seconds. The network evaluated the validating examples correctly after only several hundred epochs; however, training was continued until all errors were below 0.5%. Although the error could be further reduced, the algorithm was stopped at this point to prevent overtraining. Once the training had completed, the networked was queried against the testing data, and the results were output to a file for further analysis. Results & Discussion The neural network approach to spam filtering shows promising results. On the test set, all messages were properly classified. There were neither false positives nor false negatives; an ideal situation. To generate such excellent results, five nodes were required in the hidden layer. We can infer from this fact that the partitioning of the messages in the feature space is not a trivial feat, and is definitively non-linear. The network effectively classified the entire corpus of messages, with impressive results, verifying the plausibility of such a technique. Some of this information can be seen in Appendix C. Although we were able to perform the classification on the given test set, a larger corpus needs to be tested to ensure the accuracy of the system. It also remains to be shown that the weights can be modified easily to accommodate new input, so that the network retains high precision as message content evolves over time. This neural network filtering approach is not suggested to be a stand alone solution. Other commonplace techniques, such as blacklists, should first be used to evaluate candidate messages in order to improve on the effectiveness of the overall system [6]. In this paper, we only consider a message corpus consisting of preprocessed messages, that is, messages that have already undergone initial filtering techniques. We offer this approach as a replacement for the Bayesian filter commonly used as a component of an entire filtering system. Also, it should be noted that the proposed neural network does not perform feature extraction, rather, we depend on a separate preprocessing component to perform this task. In this paper, we perform the feature extraction somewhat manually, although this process could easily be performed automatically by a separate unit. Project Report 5
7 References [1] Author Unknown, About FAR, FRR, and EER, Human Scan, [Online Document], Mar. 2004, [cited 2008 April 17]. Available: [2] A. Obied, Bayesian Spam Filtering, Department of Computer Science, University of Calgary, [3] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, A Memory- Based Approach to Anti-Spam Filtering for Mailing Lists, Information Retrieval, vol. 6, pp , [4] G. Cormack, T. Lynam, Spam Track Guidlines - TREC , [Online Document], Jul. 2007, [cited 2008 April 17]. Available: [5] P. Hoffman, Display of Internationalized Mail Addresses Through Address Mapping, Internet Mail Consortium, [Online Document], Feb. 2003, [cited 2008 April 17]. Available: [6] J. Kong, B. Rezaei, N. Sarshar, V. Roychowdhury, P. Boykin, "Collaborative Spam Filtering Using Networks," Computer, vol. 39, no. 8, pp , Aug., 2006 Project Report 6
8 Appendix A - Building Hash Tables Appendix B - Network Topology Project Report 7
9 Appendix C - Analysis of Results Project Report 8
An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationSpam Classification Documentation
Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationINCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING
INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR EMAIL SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,
More informationCollaborative Spam Mail Filtering Model Design
I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme
More informationOverview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005
Overview of the TREC 2005 Spam Track Gordon V. Cormack Thomas R. Lynam 18 November 2005 To answer questions! Why Standardized Evaluation? Is spam filtering a viable approach? What are the risks, costs,
More informationMailspike. Henrique Aparício
Mailspike Henrique Aparício 1 Introduction For many years now, email has become a tool of great importance as a means of communication. Its growing use led inevitably to its exploitation by entities that
More informationAccuracy Analysis of Neural Networks in removal of unsolicited s
Accuracy Analysis of Neural Networks in removal of unsolicited e-mails P.Mohan Kumar P.Kumaresan S.Yokesh Babu Assistant Professor (Senior) Assistant Professor Assistant Professor (Senior) SITE SITE SCSE
More informationClassification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska
Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute
More informationDetecting Spam with Artificial Neural Networks
Detecting Spam with Artificial Neural Networks Andrew Edstrom University of Wisconsin - Madison Abstract This is my final project for CS 539. In this project, I demonstrate the suitability of neural networks
More informationLECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS
LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class
More informationA Reputation-based Collaborative Approach for Spam Filtering
Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for
More informationKeywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationSupplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion
Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University
More informationProbabilistic Anti-Spam Filtering with Dimensionality Reduction
Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are
More informationCSI5387: Data Mining Project
CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play
More informationHidden Loop Recovery for Handwriting Recognition
Hidden Loop Recovery for Handwriting Recognition David Doermann Institute of Advanced Computer Studies, University of Maryland, College Park, USA E-mail: doermann@cfar.umd.edu Nathan Intrator School of
More informationDecision Science Letters
Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence
More informationBayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
More information2. On classification and related tasks
2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.
More informationSpam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use
Email Detection ECE 539 Fall 2013 Ethan Grefe For Public Use Introduction email is sent out in large quantities every day. This results in email inboxes being filled with unwanted and inappropriate messages.
More informationCake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.
Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity Jesse Kornblum Outline Introduction Artificial Intelligence Spam Detection Clustering
More informationTo earn the extra credit, one of the following has to hold true. Please circle and sign.
CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the
More informationChapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction
CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle
More informationAUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES
AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES K. P. M. L. P. Weerasinghe 149235H Faculty of Information Technology University of Moratuwa June 2017 AUTOMATED STUDENT S
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationTechnical Brief: Domain Risk Score Proactively uncover threats using DNS and data science
Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily
More informationSchematizing a Global SPAM Indicative Probability
Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationCHEAP, efficient and easy to use, has become an
Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationA generalized additive neural network application in information security
Lecture Notes in Management Science (2014) Vol. 6: 58 64 6 th International Conference on Applied Operational Research, Proceedings Tadbir Operational Research Group Ltd. All rights reserved. www.tadbir.ca
More informationConstructively Learning a Near-Minimal Neural Network Architecture
Constructively Learning a Near-Minimal Neural Network Architecture Justin Fletcher and Zoran ObradoviC Abetract- Rather than iteratively manually examining a variety of pre-specified architectures, a constructive
More informationAN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM
ISSN: 2229-6956(ONLINE) DOI: 1.21917/ijsc.212.5 ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationDATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10
COMP717, Data Mining with R, Test Two, Tuesday the 28 th of May, 2013, 8h30-11h30 1 DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100
More informationSolution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution
Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set
More informationList of figures List of tables Acknowledgements
List of figures List of tables Acknowledgements page xii xiv xvi Introduction 1 Set-theoretic approaches in the social sciences 1 Qualitative as a set-theoretic approach and technique 8 Variants of QCA
More informationEVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST
EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST Enrico Blanzieri and Anton Bryl May 2007 Technical Report # DIT-07-025 Evaluation of the Highest
More informationFingerprint Feature Extraction Based Discrete Cosine Transformation (DCT)
Fingerprint Feature Extraction Based Discrete Cosine Transformation (DCT) Abstract- Fingerprint identification and verification are one of the most significant and reliable identification methods. It is
More informationChapter-8. Conclusion and Future Scope
Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative
More informationEquation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.
Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationHandwritten Text Recognition
Handwritten Text Recognition M.J. Castro-Bleda, Joan Pasto Universidad Politécnica de Valencia Spain Zaragoza, March 2012 Text recognition () TRABHCI Zaragoza, March 2012 1 / 1 The problem: Handwriting
More informationSpam Filtering Using Visual Features
Spam Filtering Using Visual Features Sirnam Swetha Computer Science Engineering sirnam.swetha@research.iiit.ac.in Sharvani Chandu Electronics and Communication Engineering sharvani.chandu@students.iiit.ac.in
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More informationIntroduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple
Table of Contents Introduction...2 Overview...3 Common techniques to identify SPAM...4 Greylisting...5 Dictionary Attack...5 Catchalls...5 From address...5 HELO / EHLO...6 SPF records...6 Detecting SPAM...6
More informationPart 2. Reviewing and Interpreting Similarity Reports
Part 2. Reviewing and Interpreting Similarity Reports Introduction By now, you have begun using CrossCheck and have found manuscripts with a range of different similarity levels. Now what do you do? The
More informationCombined Weak Classifiers
Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract
More informationNeural Network Neurons
Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given
More informationThe PAGE (Page Analysis and Ground-truth Elements) Format Framework
2010,IEEE. Reprinted, with permission, frompletschacher, S and Antonacopoulos, A, The PAGE (Page Analysis and Ground-truth Elements) Format Framework, Proceedings of the 20th International Conference on
More informationCMPT 882 Week 3 Summary
CMPT 882 Week 3 Summary! Artificial Neural Networks (ANNs) are networks of interconnected simple units that are based on a greatly simplified model of the brain. ANNs are useful learning tools by being
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationCountering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification
More informationBayesian Spam Filtering Using Statistical Data Compression
Global Journal of researches in engineering Numerical Methods Volume 11 Issue 7 Version 1.0 December 2011 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationDeep Face Recognition. Nathan Sun
Deep Face Recognition Nathan Sun Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an
More informationSupervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.
Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce
More informationDetecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray, Sven Krasser Motivation Spam: More than Just a
More informationCHAPTER 4 CONTENT BASED FILTERING
74 CHAPTER 4 CONTENT BASED FILTERING 4.1 INTRODUCTION Many anti-spam techniques have been proposed by researchers to combat spam, but no method provides a successful solution to reduce false positives
More informationFraud Detection using Machine Learning
Fraud Detection using Machine Learning Aditya Oza - aditya19@stanford.edu Abstract Recent research has shown that machine learning techniques have been applied very effectively to the problem of payments
More informationCluster Analysis Gets Complicated
Cluster Analysis Gets Complicated Collinearity is a natural problem in clustering. So how can researchers get around it? Cluster analysis is widely used in segmentation studies for several reasons. First
More informationSPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition
Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term
More informationPerformance Evaluation
Chapter 4 Performance Evaluation For testing and comparing the effectiveness of retrieval and classification methods, ways of evaluating the performance are required. This chapter discusses several of
More informationComputer Vision. Exercise Session 10 Image Categorization
Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category
More informationUse of Extreme Value Statistics in Modeling Biometric Systems
Use of Extreme Value Statistics in Modeling Biometric Systems Similarity Scores Two types of matching: Genuine sample Imposter sample Matching scores Enrolled sample 0.95 0.32 Probability Density Decision
More informationInternational Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN
International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 2321-3469 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Srikanth Bethu
More informationData Mining. Neural Networks
Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationIntrusion Detection and Violation of Compliance by Monitoring the Network
International Journal of Computer Science and Engineering Open Access Research Paper Volume-2, Issue-3 E-ISSN: 2347-2693 Intrusion Detection and Violation of Compliance by Monitoring the Network R. Shenbaga
More informationIn this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.
December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)
More informationA System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification
A System to Automatically Index Genealogical Microfilm Titleboards Samuel James Pinson, Mark Pinson and William Barrett Department of Computer Science Brigham Young University Introduction Millions of
More informationHandwritten Text Recognition
Handwritten Text Recognition M.J. Castro-Bleda, S. España-Boquera, F. Zamora-Martínez Universidad Politécnica de Valencia Spain Avignon, 9 December 2010 Text recognition () Avignon Avignon, 9 December
More informationMass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality
Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Abstract: Mass classification of objects is an important area of research and application in a variety of fields. In this
More informationA modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University
More informationData Preprocessing. Supervised Learning
Supervised Learning Regression Given the value of an input X, the output Y belongs to the set of real values R. The goal is to predict output accurately for a new input. The predictions or outputs y are
More information6.034 Quiz 2, Spring 2005
6.034 Quiz 2, Spring 2005 Open Book, Open Notes Name: Problem 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts) Score 1 1 Decision Trees (13
More informationAutomatic Creation of Digital Fast Adder Circuits by Means of Genetic Programming
1 Automatic Creation of Digital Fast Adder Circuits by Means of Genetic Programming Karim Nassar Lockheed Martin Missiles and Space 1111 Lockheed Martin Way Sunnyvale, CA 94089 Karim.Nassar@lmco.com 408-742-9915
More informationMODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour
MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11 Nearest Neighbour Classifier Keywords: K Neighbours, Weighted, Nearest Neighbour 1 Nearest neighbour classifiers This is amongst the simplest
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationCANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.
CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known
More informationPerceptron-Based Oblique Tree (P-BOT)
Perceptron-Based Oblique Tree (P-BOT) Ben Axelrod Stephen Campos John Envarli G.I.T. G.I.T. G.I.T. baxelrod@cc.gatech sjcampos@cc.gatech envarli@cc.gatech Abstract Decision trees are simple and fast data
More informationRita McCue University of California, Santa Cruz 12/7/09
Rita McCue University of California, Santa Cruz 12/7/09 1 Introduction 2 Naïve Bayes Algorithms 3 Support Vector Machines and SVMLib 4 Comparative Results 5 Conclusions 6 Further References Support Vector
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationList of Exercises: Data Mining 1 December 12th, 2015
List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring
More informationEnd-To-End Spam Classification With Neural Networks
End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationA New Online Clustering Approach for Data in Arbitrary Shaped Clusters
A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK
More informationChapter 6 Evaluation Metrics and Evaluation
Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationVisual object classification by sparse convolutional neural networks
Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.
More informationSender Reputation Filtering
This chapter contains the following sections: Overview of, on page 1 SenderBase Reputation Service, on page 1 Editing Score Thresholds for a Listener, on page 4 Entering Low SBRS Scores in the Message
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More information2 OVERVIEW OF RELATED WORK
Utsushi SAKAI Jun OGATA This paper presents a pedestrian detection system based on the fusion of sensors for LIDAR and convolutional neural network based image classification. By using LIDAR our method
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationFighting Spam, Phishing and Malware With Recurrent Pattern Detection
Fighting Spam, Phishing and Malware With Recurrent Pattern Detection White Paper September 2017 www.cyren.com 1 White Paper September 2017 Fighting Spam, Phishing and Malware With Recurrent Pattern Detection
More information