NON-PARAMETRIC SPAM FILTERING BASED ON knn AND LSA. Preslav I. Nakov, Panayot M. Dobrikov

Similar documents
An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

Keywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5

INCORPORTING KEYWORD-BASED FILTERING TO DOCUMENT CLASSIFICATION FOR SPAMMING

arxiv: v1 [cs.lg] 12 Feb 2018

ON TWO ADAPTIVE SYSTEMS FOR DOCUMENT MANAGEMENT * Vanyo G. Peychev, Ivo I. Damyanov

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Content Vector Model for Text Classification

On Effective Classification via Neural Networks

E-COMMERCE SOLUTIONS BASED ON THE EDI STANDARD. Inna I. Stoianova, Roumen V. Nikolov, Vanyo G. Peychev

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Measuring the Semantic Similarity of Comments in Bug Reports

Schematizing a Global SPAM Indicative Probability

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

EXTRACTING BUSINESS RULES THROUGH STATIC ANALYSIS OF THE SOURCE CODE * Krassimir Manev, Neli Maneva, Haralambi Haralambiev

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A REPRESENTATION OF BINARY MATRICES * Hristina Kostadinova, Krasimir Yordzhev

CS 6320 Natural Language Processing

EVALUATION OF THE HIGHEST PROBABILITY SVM NEAREST NEIGHBOR CLASSIFIER WITH VARIABLE RELATIVE ERROR COST

Query Difficulty Prediction for Contextual Image Retrieval

JURD: Joiner of Un-Readable Documents algorithm for Reversing the Effects of Tokenisation Attacks against Content-based Spam Filters

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Identifying Important Communications

AN EXAMPLE FOR THE USE OF BITWISE OPERATIONS IN PROGRAMMING. Krasimir Yordzhev

CS229 Final Project: Predicting Expected Response Times

Naive Bayes spam filtering using word-position-based attributes and length-sensitive classification thresholds

Project Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI

ResPubliQA 2010

Information Retrieval

CREATING CALL AND REFERENCE GRAPHS USING TERM REWRITING * Todor Cholakov, Dimitar Birov

Bayesian Spam Filtering Using Statistical Data Compression

Comparing SpamAssassin with CBDF filtering

2. Design Methodology

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

Text Classification for Spam Using Naïve Bayesian Classifier

Filtering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout

Recommender Systems: User Experience and System Issues

CS5670: Computer Vision

Information Retrieval. hussein suleman uct cs

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Refining Search Queries from Examples Using Boolean Expressions and Latent Semantic Analysis

MEMORY SENSITIVE CACHING IN JAVA

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

String Vector based KNN for Text Categorization

Reading group on Ontologies and NLP:

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Similarity search in multimedia databases

An Increasing Efficiency of Pre-processing using APOST Stemmer Algorithm for Information Retrieval

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

CSI5387: Data Mining Project

WAREHOUSING AND OLAP ANALYSIS OF DATA ABOUT UNIQUE BULGARIAN BELLS * Tihomir Trifonov, Tsvetanka Georgieva-Trifonova

How to use indexing languages in searching

6.034 Quiz 2, Spring 2005

Patent Classification Using Ontology-Based Patent Network Analysis

Domain Specific Search Engine for Students

Support Vector Machines + Classification for IR

CS54701: Information Retrieval

Domain-specific Concept-based Information Retrieval System

Semantic Search in s

An automatic linking service of document images reducing the effects of OCR errors with latent semantics

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Cross Corpora Discovery Via Minimal Spanning Tree Exploration

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Header Based Spam Filtering Using Machine Learning Approach

Encoding Words into String Vectors for Word Categorization

International ejournals

Your Green Marketing Partner. Are you making it easy for them to opt-out?

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

Extraction of Web Image Information: Semantic or Visual Cues?

Keyword Extraction by KNN considering Similarity among Features

Feature-weighted k-nearest Neighbor Classifier

THE COMPUTER EXPERIMENT IN SCHOOL AS A METHODOLOGICAL TOOL IN SCIENCE EDUCATION

Influence of Word Normalization on Text Classification

Generating Estimates of Classification Confidence for a Case-Based Spam Filter

Analysis on the technology improvement of the library network information retrieval efficiency

Information Retrieval: Retrieval Models

Automated Online News Classification with Personalization

Data mining: concepts and algorithms

Thesaurus and Domain Ontology of Ecoinformatics

Using TAR 2.0 to Streamline Document Review for Japanese Patent Litigation

A generalized additive neural network application in information security

Approach Research of Keyword Extraction Based on Web Pages Document

An Improvement of Centroid-Based Classification Algorithm for Text Classification

SPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition

Published in A R DIGITECH

Social Media Computing

Spam Decisions on Gray using Personalized Ontologies

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

Using Text Learning to help Web browsing

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Computer aided mail filtering using SVM

A Reputation-based Collaborative Approach for Spam Filtering

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Web Service Recommendation Using Hybrid Approach

METEOR-S Web service Annotation Framework with Machine Learning Classification

Transcription:

МАТЕМАТИКА И МАТЕМАТИЧЕСКО ОБРАЗОВАНИЕ, 2004 MATHEMATICS AND EDUCATION IN MATHEMATICS, 2004 Proceedings of the Thirty Third Spring Conference of the Union of Bulgarian Mathematicians Borovets, April 1 4, 2004 NON-PARAMETRIC SPAM FILTERING BASED ON knn AND LSA Preslav I. Nakov, Panayot M. Dobrikov The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages, also known as spam. The email messages text is represented as an LSA vector, which is then fed into a knn classifier. The method shows a high accuracy on a collection of recent personal email messages. Tests on the standard LINGSPAM collection achieve an accuracy of over 99.65%, which is an improvement on the best-published results to date. Introduction. The amount of unsolicited commercial e-mails (also known as spam) has grown tremendously during the last few years and today already represents the majority of the Internet traffic. Although the spam is normally easy to recognise and delete, doing so on an everyday basis is inconvenient. Once an e-mail address has entered in the widespread spam distribution lists it could become almost unusable unless some automated measures are taken. Nowadays, it is largely recognised that the constantly changing spam form and contents requires filtering using machine learning (ML) techniques, allowing an automated training on up-to-date representative collections. The potential of learning has been first demonstrated by Shami et al. [1] who used a Naïve Bayesian classifier. Several other researchers tried this thereafter and some specialised collections have been created to train ML algorithms. The most famous one LINGSPAM (a set of e-mail messages from the Linguist List) is accepted as a standard set for evaluating the potential of different approaches [2]. Categorization with knn and LSA. We used the k-nearest-neighbour classifier, which has been proved to be among the best performing text categorisation algorithms in many cases [3]. knn calculates a similarity score between the document to be classified and each of the labelled documents in the training set. When k = 1 the class of the most similar document is selected. Otherwise, the classes of the k closest documents are used, taking into account their scores. We combined knn with latent semantic analysis (LSA). This is a popular technique for indexing, retrieval and analysis of textual data, and assumes a set of mutual latent dependencies between the terms and the contexts they are used in. This permits LSA to deal successfully with synonymy and partially with polysemy, which are the major problems with the word-based text processing techniques (due to the freedom and variability of expression). LSA is a two-stage process including learning and analysis. During the 231

learning phase a text collection is given that produces a real-valued vector for each term and for each document. The second phase is the analysis when the proximity between a pair of documents or terms is calculated as the dot product between their normalised LSA vectors [4]. In our experiments, we built an LSA matrix (TF.IDF weighted) from the messages in the training set. The e-mail message to be classified is projected in the LSA space and then compared to each one from the training set. Then a knn classifier for a particular value of k predicts its class. Experiments and evaluation. We used two document collections: LINGSPAM 2,411 non-spam and 481 spam messages; personal collection 940 non-spam and 575 spam messages; it includes all email messages of one of the authors for a period of 5 consecutive weeks. While LINGSPAM is largely accepted as the ultimate test collection for spam filtering and allows for comparison between different approaches, it gives no practical idea of the real algorithm performance (just as any other frozen test collection). In a situation where the spam emails are constantly changing, so should do the test collections. A good solution is to use someone s real emails for some period of time. In this case it is very important to include all the emails no matter what they contain. On LINGSPAM we measured the categorization accuracy using a stratified 10-fold cross validation. For the personal collection though, we adopted a more pragmatic approach: Training. We trained on all the email messages from the first 4 consecutive weeks: 766 non-spam and 453 spam. Testing. We tested on all emails from the subsequent fifth week: 174 non-spam and 122 spam. We performed several experiments using as features: words as met in the text, stems as well as the original tokens as identified by the LINGSPAM author. Each of these features has been tried with and without stop-words removal. Although stripping the stop-words is beneficial for text categorisation in general, this is not always the case: e.g. they are very important features for authorship attribution (which is a text categorisation task). It was not clear to us whether they are good for spam filtering, but they were kept in the original LINGSPAM tokenisation, so we felt it was necessary to try either case. We ended up with the following features: RAW raw words, as met in the text; RAWNS like RAW but with stop-words filtered out; STEM stemmed words (Porter stemmer); STEMNS like STEM but with stop-words filtered out; TOKEN original tokenisation as provided with LINGSPAM; TOKENNS like TOKEN but with stop-words filtered out. The results from a stratified 10-fold crossvalidation on LINGSPAM are shown on Figure 1, where the accuracy is drawn as a function of the number of neighbours k used in knn. The highest accuracy of over 99.65% is achieved for moderate values of k such as 3 and 4. 232

The LINGSPAM tokenisation is richer than the other representations since it includes not only words but also some special symbols that can be good features for spam identification. Our experiments though, indicate this is not the case. In fact, as Figure 1 shows, TOKEN is the worst feature set. All the three features sets: TOKEN, RAW and STEM, perform much worse than their variants with stop-words removed (TOKENNS, RAWNS and STEMNS). Discussion. Below we analyse the spam classifier s errors on the personal collection. Some newsletters have been classified as spam, although for other people they can be desired messages: e.g. The Intel Developer Services News and the Bravenet News. These examples show a well-known obstacle to the spam filtering techniques: a message that is absolutely valid for someone can be a spam for somebody else. However, in these two cases, there is no other newsletter from these two companies in the training set hence this is a true error. In general, once one or two issues of the newsletter are marked as spam by the user the following ones will be classified successfully. A similar example is the following e-mail from the Sun s JCP community, which has been annotated as spam in the testing set (while the receiver was subscribed), and as a non-spam by the classifier: From: Stefan Hepper <sthepper@de.ibm.com> To: JSR-168-EG@JCP.ORG Subject: [SAP.JSR.168] Re: Not too late? GenericPortlet.render() Hi Chris, the portal need to communicate the changed title to the portlet container, so that the portlet container can return a resource bundle that is correct for the current portlet instance. How would otherwise a portlet create a dynamic title... Another misclassified e-mail was an obvious spam (an offer for enlargement of some organ). The reason why this happened was that we let the headers of the e-mail to be parsed. In this case, by coincidence there were many company e-mails from the address book in the CC list, which look similarity to other internal company e-mails. Another error was a spam e-mail written in Russian (the only Cyrillic one). The classifier had no similar examples and should have made its decision based on the English header (not shown): Заберите у меня все, чем я обладаю, но оставьте мне мою речь, и скоро я обрету все, что имел Дэниэл Уэбстер Вся наша жизнь строится на общении так устроено человеческое общество. Поэтому наибольших успехов в личной жизни... Another ambiguous e-mail: Our friendship wont last that long? biyv gtdm zh vicd lns kjq alqnynuygbkojpzhpwkgbrmsvm We now focus on the non-spam messages, which were classified as spam. This is a much more important case, as any such miss can invalidate the good results. There were three such e-mails: Newsletter from ACM Dear ACM TechNews Subscriber: Welcome to the September 15, 2003 edition of ACM TechNews, providing timely information for IT professionals three times a week. For instructions on how to unsubscribe from this service, please see below... 233

Figure 1. Accuracy on LINGSPAM as a function of the number of neighbours k, knn. Google account confirmation Welcome to Google Accounts. To activate your account and verify your e-mail address, please click on the following link: http://www.google.com/accounts/ve?c=8685708803317565504. If you have received this mail in error, please forward it to accounts-noreply@google.com... Message after request for product download Dear Mr. Dobrikov, Thank you for your recent download of WebLogic Platform 8.1 for Microsoft Windows (32 bit). Please click here to access installation and configuration guides to start you on your way using your download... All the three cases exhibit some usual spam properties: phrases of the type click here or download, unsubscribe information etc., which mislead the classifier. Conclusion and future work. We have shown that simple non-parametric ML methods such as knn combined with LSA achieve state of the art accuracy on spam filtering. These results are encouraging and show the potential of the knn+lsa combination. Note, that we used as features just raw words or stems from the message text and no other features, such as: subject, sender email, unsubscribe information, capitalisation, formatting etc., which are important knowledge sources and are used in most popular email filtering systems in operation. It was somewhat surprising to find that the exploitation of just the email text leads to an improvement on the best results on LINGSPAM. We believe the knn classifier has a big potential and is a suitable candidate for spam filtering since, unlike the parametric approaches (e.g. Bayesian nets), it can 234

learn from only few (often just one) examples. We plan to try knn when using just non-word features as well as in combination with LSA. REFERENCES [1] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. 1998. A Bayesian Approach to Filtering Junk E-Mail. In: Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05. [2] I. Androutsopolos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, P. Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. Technical Report DEMO 2000/5, NCSR Demokritos, Greece, 2000. [3] Y. Yang. An evaluation of statistical approaches to text categorization. Journal of IR, 1, 1999, No 1/2, 67 88. [4] T. Landauer, P. Foltz, D. Laham. Introduction to LSA. Discourse Processes, 25, 1998, 259 284. Preslav Nakov University of California EECS, CS dept. Berkeley CA 94720, USA e-mail: nakov@cs.berkeley.edu Panayot Dobrikov SAP A. G. Neurottstrase 16 69190 Walldorf, J2EE e-mail: panayot.dobrikov@sap.com НЕПАРАМЕТРИЧНО ФИЛТРИРАНЕ НА СПАМ С ИЗПОЛЗВАНЕ НА k-те НАЙ-БЛИЗКИ СЪСЕДА И ЛАТЕНТЕН СЕМАНТИЧЕН АНАЛИЗ Преслав Ив. Наков, Панайот М. Добриков Представен е непраметричен подход към проблема за филтриране на нежелани търговски електронни съобщения, известни още като спам. Текстът на електронните съобщения се представя като ЛСА вектор, впоследствие използван от класификатор от тип k-те най-близки съседа. Методът демонстрира висока точност върху колекция от лични електронни съобщения. Тестове върху стандартната колекция LINGSPAM показват точност от 99.65%, което е подобрение по отношение на най-добрите публикувани резултати. 235