Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

Size: px
Start display at page:

Download "Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI"

Transcription

1 University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI University of Houston Clear Lake - School of Science & Computer Engineering 2700 Bay Area Blvd Houston, Texas T

2 Spam Filtering - University of Houston Clear Lake School of Science & Computer Engineering An Alternative Approach Table of Contents I. Abstract - Brief overview of the spam filtering problem statement, approach taken, and results / findings II. Introduction - Background and history of spam filtering, as well as brief highlight of proposed techniques III. Objectives - Goals of the project, in regards to techniques and desired results in quantitative terms IV. Methods and Procedures - In depth explanation of the methods involved in the spam filtering of the project V. Results and Discussion - Results of the above techniques and discussion of the outcome VI. References - Sources referencing the material described in the introduction and other portions of the project VII. Appendices - Project Poster, Data, etc Abstract Due to the ever increasing amount of unsolicited , commonly referred to as spam, techniques have arose for combating such messages. Although prevalent, Bayesian filters often misclassify legitimate . We provide a supervised neural network approach to the filtering problem. This technique shows promising results, with zero misclassification on a corpus of moderate size. Introduction The notion of effective spam filtering has long since been a problem. The difficulty lies in the necessity of substantially small false rejection rates, as the misclassification of a valid is rarely tolerable. Because senders of unsolicited ( spammers ) very often disguise their messages to appear as valid correspondence, perfect filtering is impossible. Borrowing terminology from the field of biometrics, it would be possible to calculate the equal error rate of a given test set. From this value, we could produce a system that would be capable of balancing the false rejection rate with the false acceptance rate [1]. However, this approach will typically yield unacceptable results, as false rejections come as a much higher cost to the user - the result of missing an important usually outweighs the improper acceptance of a particular spam message. Therefore, a virtually non-existent false rejection rate is necessary, so we instead focus on reducing the positive acceptance rate as much as possible while maintaing a minimal false rejection rate. Project Report 1

3 Traditional spam filters use Bayesian techniques; that is, statistical methods based on Bayes Theorem. As described by Obied, the probability that an message belongs to a particular class can be calculated using Bayes Theorem as follows: where each class C represents either the class of spam messages or the class of non-spam messages, and each fi represents a particular feature used in classifying messages [2]. Most Bayesian spam filters assume no a priori knowledge; initial probabilities are unknown. Through training, these filters can learn to differentiate effectively between spam and non-spam messages. An alternative approach to Bayesian techniques, as suggested by this paper, involves machine learning through the use of a supervised neural network. Other machine learning techniques have recently been attempted, including variants of k- Nearest-Neighbor (k-nn) classifiers [3]. An advantage of these techniques is that they commonly make use of an adjustable threshold, allowing the sensitivity of the filter to be modified at will. The approach that we present here does not provide such options, rather, we seek to prove that our method will perform filtering with enough precision to eliminate the need for such adjustments. For practically any spam filtering scheme, we must build and maintain databases containing the frequency with which specific words typically appear in both spam and ham (legitimate) messages. One problem arises when determining how to efficiently store and search for these words within our databases. A common practice, frequently used in Bayesian filtering, is to maintain two separate hash tables, one for each type of spam/ham mail classification. Each entry appearing in the hash tables consists of a key-value pair: a specific word paired with the number of occurrences (for a specific classification) found during the training phase. Here, the term training phase refers to the training phase of Bayesian filters, which may continue even after the initial setup. For our purposes, we will consider this process part of the preprocessing, and will reserve the term training to refer explicitly to the training of the neural network itself. Once these databases have been built, either from publicly found corpora or from past personal s, the neural network can be trained. To implement the supervised learning approach, we require access to an adequate number of properly classified messages. From this extended corpus, we will extract a minimal number of features. For each message, will will determine a spam score and a ham score. Each word in the message is compared to both databases. If a particular word appears in a database, the frequency of the word (stored in the hash table of the particular database) will be added to the appropriate spam or ham score for this message. Therefore, the spam score for a given message is directly related to the frequency that the message s words appear in the spam database. The same is true for the ham Project Report 2

4 score for a particular message. Using only the spam score and the ham score of each , we seek to provide effective classification of the messages. Objectives We attempt to show that the spam filtering problem can be simplified into a two-dimensional classification problem, requiring only the spam and ham word frequencies for each message. We seek to prove that a neural network can perform the non-linear partitioning of this feature space. The purpose of this paper is not only to show that a neural network is indeed feasible, but also to demonstrate the accuracy of such a method. We provide quantitative results, supporting our claims. Methods & Procedures The first step in any spam filtering process is the acquisition of an adequate sized corpus. Although several public corpora are publicly available, the majority of these packages do not differentiate between which s are spam and which are legitimate messages [4]. Many other compilations include a large number of spam messages, but do not include any ham whatsoever. For this reason, the ham corpus used consisted primarily of personal s collected over the last few months. The spam corpus used is freely and publicly available at with over 500 spam messages included in the creation of our databases. The ham corpus contained only ham messages in the emlx format, requiring a conversion utility. A small program, written in AppleScript, was used to convert the messages to a UTF-8 format, the standard format recommended by the Internet Mail Consortium (IMC) [5]. Once both sets of messages were successfully stored in this common format, we proceeded. To create and maintain the hash tables necessary to store the respective ham and spam word counts, we modified an open source perl script. The script performed the database creation, saving the hash tables in long term storage. All of the spam messages were combined into a single file, as were all of the ham messages. We proceeded by calling the script with each file, signifying the appropriate classification in each case. A total of 500 spam messages were placed in the database, resulting in 699,100 unique words. Similarly, a total of 282 ham messages resulted in the storage of 294,531 unique words. This process is shown in Appendix A. After the hash tables were built, we again modified the perl script to allow a list of files to be passed as an argument, together with a flag signifying the appropriate classification. For each file in the list, we calculated the spam and ham score, as described in the previous section. These values, together with a boolean flag signifying the proper classification, were written to an output file. This task completed the preprocessing portion of the project. Project Report 3

5 Using only two values as input (spam and ham scores), we expect the plot of these points to be non-linearly separable. This characteristic can be seen in the generalized Venn diagram shown below, and is further exemplified by observing the overlap present in the plot of the actual input values: Distribution Ham? Ham Score 120, , , ,500 94,000 87,500 81,000 74,500 68,000 61,500 55,000 SPAM 55,000 79, ,000 Spam Ham 128, , , , ,500 Spam 251, , ,000 To ensure correct classification, we must ensure that the type of network chosen is capable of non-linear separation. For this reason, a feed-forward neural network was chosen, using the backpropagation algorithm for training. To implement the neural network, the software package EasyNN-plus was used. The output file from the preprocessing stage was used as input to the network, after ordering of the data was randomized. As an initial approach, a single hidden layer was used. The program was allowed to specify the most appropriate choice for the number of nodes in the hidden layer, resulting in a total of five nodes. Thus, the network topology became a feedforward network. A visualization of the network topology is given in Appendix B. Project Report 4

6 For the training, the learning rate was initially set to 0.80 and allowed both to decay and to be optimized. The momentum was set to zero, as there was no ordering of the data. We signified that the first 400 values should be used for training. We reserved 71 values for validation purposes, leaving 300 values for testing. Training took place over 27,916 epochs, and took approximately 45 seconds. The network evaluated the validating examples correctly after only several hundred epochs; however, training was continued until all errors were below 0.5%. Although the error could be further reduced, the algorithm was stopped at this point to prevent overtraining. Once the training had completed, the networked was queried against the testing data, and the results were output to a file for further analysis. Results & Discussion The neural network approach to spam filtering shows promising results. On the test set, all messages were properly classified. There were neither false positives nor false negatives; an ideal situation. To generate such excellent results, five nodes were required in the hidden layer. We can infer from this fact that the partitioning of the messages in the feature space is not a trivial feat, and is definitively non-linear. The network effectively classified the entire corpus of messages, with impressive results, verifying the plausibility of such a technique. Some of this information can be seen in Appendix C. Although we were able to perform the classification on the given test set, a larger corpus needs to be tested to ensure the accuracy of the system. It also remains to be shown that the weights can be modified easily to accommodate new input, so that the network retains high precision as message content evolves over time. This neural network filtering approach is not suggested to be a stand alone solution. Other commonplace techniques, such as blacklists, should first be used to evaluate candidate messages in order to improve on the effectiveness of the overall system [6]. In this paper, we only consider a message corpus consisting of preprocessed messages, that is, messages that have already undergone initial filtering techniques. We offer this approach as a replacement for the Bayesian filter commonly used as a component of an entire filtering system. Also, it should be noted that the proposed neural network does not perform feature extraction, rather, we depend on a separate preprocessing component to perform this task. In this paper, we perform the feature extraction somewhat manually, although this process could easily be performed automatically by a separate unit. Project Report 5

7 References [1] Author Unknown, About FAR, FRR, and EER, Human Scan, [Online Document], Mar. 2004, [cited 2008 April 17]. Available: [2] A. Obied, Bayesian Spam Filtering, Department of Computer Science, University of Calgary, [3] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, and P. Stamatopoulos, A Memory- Based Approach to Anti-Spam Filtering for Mailing Lists, Information Retrieval, vol. 6, pp , [4] G. Cormack, T. Lynam, Spam Track Guidlines - TREC , [Online Document], Jul. 2007, [cited 2008 April 17]. Available: [5] P. Hoffman, Display of Internationalized Mail Addresses Through Address Mapping, Internet Mail Consortium, [Online Document], Feb. 2003, [cited 2008 April 17]. Available: [6] J. Kong, B. Rezaei, N. Sarshar, V. Roychowdhury, P. Boykin, "Collaborative Spam Filtering Using Networks," Computer, vol. 39, no. 8, pp , Aug., 2006 Project Report 6

8 Appendix A - Building Hash Tables Appendix B - Network Topology Project Report 7

9 Appendix C - Analysis of Results Project Report 8

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Spam Classification Documentation

Spam Classification Documentation Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR  SPAMMING INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR EMAIL SPAMMING TAK-LAM WONG, KAI-ON CHOW, FRANZ WONG Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue,

More information

Collaborative Spam Mail Filtering Model Design

Collaborative Spam Mail Filtering Model Design I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme

More information

Overview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005

Overview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005 Overview of the TREC 2005 Spam Track Gordon V. Cormack Thomas R. Lynam 18 November 2005 To answer questions! Why Standardized Evaluation? Is spam filtering a viable approach? What are the risks, costs,

More information

Mailspike. Henrique Aparício

Mailspike. Henrique Aparício Mailspike Henrique Aparício 1 Introduction For many years now, email has become a tool of great importance as a means of communication. Its growing use led inevitably to its exploitation by entities that

More information

Accuracy Analysis of Neural Networks in removal of unsolicited s

Accuracy Analysis of Neural Networks in removal of unsolicited  s Accuracy Analysis of Neural Networks in removal of unsolicited e-mails P.Mohan Kumar P.Kumaresan S.Yokesh Babu Assistant Professor (Senior) Assistant Professor Assistant Professor (Senior) SITE SITE SCSE

More information

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute

More information

Detecting Spam with Artificial Neural Networks

Detecting Spam with Artificial Neural Networks Detecting Spam with Artificial Neural Networks Andrew Edstrom University of Wisconsin - Madison Abstract This is my final project for CS 539. In this project, I demonstrate the suitability of neural networks

More information

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class

More information

A Reputation-based Collaborative Approach for Spam Filtering

A Reputation-based Collaborative Approach for Spam Filtering Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for

More information

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

Keywords : Bayesian,  classification, tokens, text, probability, keywords. GJCST-C Classification: E.5 Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University

More information

Probabilistic Anti-Spam Filtering with Dimensionality Reduction

Probabilistic Anti-Spam Filtering with Dimensionality Reduction Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of e-mail communication is the massive spam message delivery Everyday billion of unwanted messages are

More information

CSI5387: Data Mining Project

CSI5387: Data Mining Project CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play

More information

Hidden Loop Recovery for Handwriting Recognition

Hidden Loop Recovery for Handwriting Recognition Hidden Loop Recovery for Handwriting Recognition David Doermann Institute of Advanced Computer Studies, University of Maryland, College Park, USA E-mail: doermann@cfar.umd.edu Nathan Intrator School of

More information

Decision Science Letters

Decision Science Letters Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence

More information

Bayesian Spam Detection

Bayesian Spam Detection Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use Email Detection ECE 539 Fall 2013 Ethan Grefe For Public Use Introduction email is sent out in large quantities every day. This results in email inboxes being filled with unwanted and inappropriate messages.

More information

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity. Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity Jesse Kornblum Outline Introduction Artificial Intelligence Spam Detection Clustering

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES

AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES AUTOMATED STUDENT S ATTENDANCE ENTERING SYSTEM BY ELIMINATING FORGE SIGNATURES K. P. M. L. P. Weerasinghe 149235H Faculty of Information Technology University of Moratuwa June 2017 AUTOMATED STUDENT S

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily

More information

Schematizing a Global SPAM Indicative Probability

Schematizing a Global SPAM Indicative Probability Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

CHEAP, efficient and easy to use, has become an

CHEAP, efficient and easy to use,  has become an Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A Multi-Resolution-Concentration Based Feature Construction Approach for Spam Filtering Guyue Mi,

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

A generalized additive neural network application in information security

A generalized additive neural network application in information security Lecture Notes in Management Science (2014) Vol. 6: 58 64 6 th International Conference on Applied Operational Research, Proceedings Tadbir Operational Research Group Ltd. All rights reserved. www.tadbir.ca

More information

Constructively Learning a Near-Minimal Neural Network Architecture

Constructively Learning a Near-Minimal Neural Network Architecture Constructively Learning a Near-Minimal Neural Network Architecture Justin Fletcher and Zoran ObradoviC Abetract- Rather than iteratively manually examining a variety of pre-specified architectures, a constructive

More information

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM ISSN: 2229-6956(ONLINE) DOI: 1.21917/ijsc.212.5 ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10 COMP717, Data Mining with R, Test Two, Tuesday the 28 th of May, 2013, 8h30-11h30 1 DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100

More information

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set

More information

List of figures List of tables Acknowledgements

List of figures List of tables Acknowledgements List of figures List of tables Acknowledgements page xii xiv xvi Introduction 1 Set-theoretic approaches in the social sciences 1 Qualitative as a set-theoretic approach and technique 8 Variants of QCA

More information

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST Enrico Blanzieri and Anton Bryl May 2007 Technical Report # DIT-07-025 Evaluation of the Highest

More information

Fingerprint Feature Extraction Based Discrete Cosine Transformation (DCT)

Fingerprint Feature Extraction Based Discrete Cosine Transformation (DCT) Fingerprint Feature Extraction Based Discrete Cosine Transformation (DCT) Abstract- Fingerprint identification and verification are one of the most significant and reliable identification methods. It is

More information

Chapter-8. Conclusion and Future Scope

Chapter-8. Conclusion and Future Scope Chapter-8 Conclusion and Future Scope This thesis has addressed the problem of Spam E-mails. In this work a Framework has been proposed. The proposed framework consists of the three pillars which are Legislative

More information

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation. Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Handwritten Text Recognition

Handwritten Text Recognition Handwritten Text Recognition M.J. Castro-Bleda, Joan Pasto Universidad Politécnica de Valencia Spain Zaragoza, March 2012 Text recognition () TRABHCI Zaragoza, March 2012 1 / 1 The problem: Handwriting

More information

Spam Filtering Using Visual Features

Spam Filtering Using Visual Features Spam Filtering Using Visual Features Sirnam Swetha Computer Science Engineering sirnam.swetha@research.iiit.ac.in Sharvani Chandu Electronics and Communication Engineering sharvani.chandu@students.iiit.ac.in

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

More information

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple

Introduction This paper will discuss the best practices for stopping the maximum amount of SPAM arriving in a user's inbox. It will outline simple Table of Contents Introduction...2 Overview...3 Common techniques to identify SPAM...4 Greylisting...5 Dictionary Attack...5 Catchalls...5 From address...5 HELO / EHLO...6 SPF records...6 Detecting SPAM...6

More information

Part 2. Reviewing and Interpreting Similarity Reports

Part 2. Reviewing and Interpreting Similarity Reports Part 2. Reviewing and Interpreting Similarity Reports Introduction By now, you have begun using CrossCheck and have found manuscripts with a range of different similarity levels. Now what do you do? The

More information

Combined Weak Classifiers

Combined Weak Classifiers Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

The PAGE (Page Analysis and Ground-truth Elements) Format Framework

The PAGE (Page Analysis and Ground-truth Elements) Format Framework 2010,IEEE. Reprinted, with permission, frompletschacher, S and Antonacopoulos, A, The PAGE (Page Analysis and Ground-truth Elements) Format Framework, Proceedings of the 20th International Conference on

More information

CMPT 882 Week 3 Summary

CMPT 882 Week 3 Summary CMPT 882 Week 3 Summary! Artificial Neural Networks (ANNs) are networks of interconnected simple units that are based on a greatly simplified model of the brain. ANNs are useful learning tools by being

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Bayesian Spam Filtering Using Statistical Data Compression

Bayesian Spam Filtering Using Statistical Data Compression Global Journal of researches in engineering Numerical Methods Volume 11 Issue 7 Version 1.0 December 2011 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Deep Face Recognition. Nathan Sun

Deep Face Recognition. Nathan Sun Deep Face Recognition Nathan Sun Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine

Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray, Sven Krasser Motivation Spam: More than Just a

More information

CHAPTER 4 CONTENT BASED FILTERING

CHAPTER 4 CONTENT BASED FILTERING 74 CHAPTER 4 CONTENT BASED FILTERING 4.1 INTRODUCTION Many anti-spam techniques have been proposed by researchers to combat spam, but no method provides a successful solution to reduce false positives

More information

Fraud Detection using Machine Learning

Fraud Detection using Machine Learning Fraud Detection using Machine Learning Aditya Oza - aditya19@stanford.edu Abstract Recent research has shown that machine learning techniques have been applied very effectively to the problem of payments

More information

Cluster Analysis Gets Complicated

Cluster Analysis Gets Complicated Cluster Analysis Gets Complicated Collinearity is a natural problem in clustering. So how can researchers get around it? Cluster analysis is widely used in segmentation studies for several reasons. First

More information

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

SPAM, generally defined as unsolicited bulk  (UBE) Feature Construction Approach for  Categorization Based on Term Space Partition Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term

More information

Performance Evaluation

Performance Evaluation Chapter 4 Performance Evaluation For testing and comparing the effectiveness of retrieval and classification methods, ways of evaluating the performance are required. This chapter discusses several of

More information

Computer Vision. Exercise Session 10 Image Categorization

Computer Vision. Exercise Session 10 Image Categorization Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category

More information

Use of Extreme Value Statistics in Modeling Biometric Systems

Use of Extreme Value Statistics in Modeling Biometric Systems Use of Extreme Value Statistics in Modeling Biometric Systems Similarity Scores Two types of matching: Genuine sample Imposter sample Matching scores Enrolled sample 0.95 0.32 Probability Density Decision

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 2321-3469 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Srikanth Bethu

More information

Data Mining. Neural Networks

Data Mining. Neural Networks Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Intrusion Detection and Violation of Compliance by Monitoring the Network

Intrusion Detection and Violation of Compliance by Monitoring the Network International Journal of Computer Science and Engineering Open Access Research Paper Volume-2, Issue-3 E-ISSN: 2347-2693 Intrusion Detection and Violation of Compliance by Monitoring the Network R. Shenbaga

More information

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

In this project, I examined methods to classify a corpus of  s by their content in order to suggest text blocks for semi-automatic replies. December 13, 2006 IS256: Applied Natural Language Processing Final Project Email classification for semi-automated reply generation HANNES HESSE mail 2056 Emerson Street Berkeley, CA 94703 phone 1 (510)

More information

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification A System to Automatically Index Genealogical Microfilm Titleboards Samuel James Pinson, Mark Pinson and William Barrett Department of Computer Science Brigham Young University Introduction Millions of

More information

Handwritten Text Recognition

Handwritten Text Recognition Handwritten Text Recognition M.J. Castro-Bleda, S. España-Boquera, F. Zamora-Martínez Universidad Politécnica de Valencia Spain Avignon, 9 December 2010 Text recognition () Avignon Avignon, 9 December

More information

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Abstract: Mass classification of objects is an important area of research and application in a variety of fields. In this

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Data Preprocessing. Supervised Learning

Data Preprocessing. Supervised Learning Supervised Learning Regression Given the value of an input X, the output Y belongs to the set of real values R. The goal is to predict output accurately for a new input. The predictions or outputs y are

More information

6.034 Quiz 2, Spring 2005

6.034 Quiz 2, Spring 2005 6.034 Quiz 2, Spring 2005 Open Book, Open Notes Name: Problem 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts) Score 1 1 Decision Trees (13

More information

Automatic Creation of Digital Fast Adder Circuits by Means of Genetic Programming

Automatic Creation of Digital Fast Adder Circuits by Means of Genetic Programming 1 Automatic Creation of Digital Fast Adder Circuits by Means of Genetic Programming Karim Nassar Lockheed Martin Missiles and Space 1111 Lockheed Martin Way Sunnyvale, CA 94089 Karim.Nassar@lmco.com 408-742-9915

More information

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11 Nearest Neighbour Classifier Keywords: K Neighbours, Weighted, Nearest Neighbour 1 Nearest neighbour classifiers This is amongst the simplest

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr. Michael Nechyba 1. Abstract The objective of this project is to apply well known

More information

Perceptron-Based Oblique Tree (P-BOT)

Perceptron-Based Oblique Tree (P-BOT) Perceptron-Based Oblique Tree (P-BOT) Ben Axelrod Stephen Campos John Envarli G.I.T. G.I.T. G.I.T. baxelrod@cc.gatech sjcampos@cc.gatech envarli@cc.gatech Abstract Decision trees are simple and fast data

More information

Rita McCue University of California, Santa Cruz 12/7/09

Rita McCue University of California, Santa Cruz 12/7/09 Rita McCue University of California, Santa Cruz 12/7/09 1 Introduction 2 Naïve Bayes Algorithms 3 Support Vector Machines and SVMLib 4 Comparative Results 5 Conclusions 6 Further References Support Vector

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

End-To-End Spam Classification With Neural Networks

End-To-End Spam Classification With Neural Networks End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

Sender Reputation Filtering

Sender Reputation Filtering This chapter contains the following sections: Overview of, on page 1 SenderBase Reputation Service, on page 1 Editing Score Thresholds for a Listener, on page 4 Entering Low SBRS Scores in the Message

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

2 OVERVIEW OF RELATED WORK

2 OVERVIEW OF RELATED WORK Utsushi SAKAI Jun OGATA This paper presents a pedestrian detection system based on the fusion of sensors for LIDAR and convolutional neural network based image classification. By using LIDAR our method

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

Fighting Spam, Phishing and Malware With Recurrent Pattern Detection Fighting Spam, Phishing and Malware With Recurrent Pattern Detection White Paper September 2017 www.cyren.com 1 White Paper September 2017 Fighting Spam, Phishing and Malware With Recurrent Pattern Detection

More information