Bayesian Spam Detection System Using Hybrid Feature Selection Method
|
|
- Antony Neal
- 5 years ago
- Views:
Transcription
1 2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN, SHUNFENG ZHOU and HUAQING MIN ABSTRACT With the rapid development of Internet, the amount of text information has increased dramatically. As such, how to effectively and accurately identify, classify and deal with these information becomes a major challenge. In this paper, we used a term frequency hybrid filter which combines the refined naïve Bayesian classifier and innovative hybrid feature selection method to detect spams. According to our experiment results, we found that the hybrid feature selection method had better spam detection performance than traditional feature selection methods. 1 INTRODUCTION By the end of 2015, the number of Chinese Internet users had broken through 650 million. has become an important method for communicating, gaining information and looking for jobs. However, in recent years, more and more spams have not only affected people s daily work and life, but also brought huge economic loss to the society [1]. Current mainstream spam-blocking method is collecting a large amount of spams and using such spams to train a classifier, so as to get the classifier to work intelligently, which can identify spams among new messages [2-4]. However, spams can attack some widely-used spam filters which use specific spam detection algorithms. Such attacks seriously affect the effectiveness and practicality of current anti-spam technology. So we should improve current antispam technology. In this paper, we put forward a new hybrid feature selection method based on refined naïve Bayesian classification, which is called term frequency hybrid filter. The experiment results demonstrate that such classifier improves performance. 1 Junying Chen, Shunfeng Zhou, Huaqing Min, Guangzhou Key Laboratory of Robotics and Intelligent Software, School of Software Engineering, South China University of Technology, Guangzhou, Guangdong, China,
2 Spam Dataset Features Selection A new mail Classifier Result Figure 1. Spam detection system. SPAM DETECTION SYSTEM DESIGN The spam detection system block diagram is shown in Figure 1. Before the e- mail classification process, pre-processing is required, which switch s into text messages. Then the sentences are split into word list, which is called space vector model. In order to reduce the calculation time and suppress noise, the classifier usually selects part of the word features [5]. Furthermore, dimensionality reduction is performed on the dataset in advance to improve performance. Finally, such trained classifier is used to identify a new and output the classification judge result. Refined Naïve Bayesian Classification Algorithm Description If the feature w i appears in document d, then the probability of document d belongs to class C i, as shown in the following: (1) In this paper, we refined the classifier by also considering the feature w i does not appear in document d: (2) Assume that any two features are independent, then based on naïve Bayesian classification algorithm, document d belongs to class C i if and only if [6]: 387
3 (3) Hybrid Feature Selection Module Huge amount of documents will produce huge feature set, which will cost a long time in training and classifying, and bring in many noises. As a result, a dimensionality reduction method is needed. General dimensionality reduction methods include feature extraction and feature selection. The feature selection methods used in text classifying include term frequency, information gain [7], mutual information[8] and chi-square detection, etc. However, traditional feature selection algorithms don t help to improve the classifier performance much. In this paper, we put forward a new hybrid feature selection method, which is called term frequency hybrid filter. Firstly, all feature words are sorted by frequencies. Then we can set the information gain, mutual information, chi-square detection or their combinations as the filter selection feature. If one word s index is more than k in the word list which is sorted by filter feature selection algorithm, filter it out and continue to choose; or select this word as a component of the feature set. Generally, k is 40%, 50% or 60% of the total amount of features, depending on the actual dataset. This hybrid feature selection method considers the classifying ability of the high-frequency words, but filters the high-frequency words with low classifying ability. Therefore, the term frequency hybrid filter combines the advantages of term frequency method and other feature selection algorithms. EXPERIMENTS AND RESULTS The dataset is collected from user mailbox, consisting of totally 811 mails, including 490 spams and 321 non-spams. Each mail had deleted the attachments, and left the theme, sender address, main body and attachment file names.10-fold cross-validation was performed on arbitrary dataset, and the result was the average of the 10 tests. Recall rate and F1 score were used as evaluation measurements, which were widely used in machine learning algorithm evaluations. F1 score considered both the correct and complete identification capabilities of the algorithms, while the recall rate was related to the misjudgment possibility. Word frequency, information gain, mutual information, chi-square detection and three hybrid feature selection combinations were used on dataset classification. The three hybrid feature selection combinations respectively use information gain, mutual information and chi-square detection method as the filter feature selection method to sort all features, and select first 50% words as 388
4 feature set components. After applying feature selection methods, the refined naïve Bayesian classifier was used to classify the dataset, and the evaluations was conducted, as shown in the Table I. As shown in Table I, mutual information method had the highest recall rate, but its F1 score was too low to use in normal cases. Hybrid feature selection combination I had a good balance in recall rate and F1 score. By applying hybrid feature selection method, useless high-frequency words were intelligently filtered out, so the performance of the naïve Bayesian classifier was improved. TABLE I. The performance of different features selection methods. Features selection method Recall rate F1score Term frequency (first 400 words) Information gain (first 1500 words) Mutual information (first 1000 words) Chi-square detection(first 400 words) Hybrid features selection combination I Hybrid features selection combination II Hybrid features selection combination III CONCLUSIONS In this paper, we refined the naïve Bayesian classifier,increasing its spam detection correctness. When applying appropriate hybrid feature selection method, as investigated in this paper, not only the classifier's detection performance can be improved, but also the computational complexity can be reduced. The experiments described in this paper demonstrated that our refined naïve Bayesian classifier combined with hybrid feature selection method can fulfill our everyday spam detection requirements. ACKNOWLEDGEMENTS This work is supported by Guangzhou Science and Technology Program (Key Laboratory Project, No ) and the Fundamental Research Funds for the Central Universities (No. 2015ZM081). 389
5 REFERENCES 1. Kanich, C., et al Spamalytics: An empirical analysis of spam marketing conversion R, 15th ACM Conference on Computer and Communications Security, Alpaydin, E Introduction to machine learning, MIT press, pp Harrington, P Machine learning in action M, Manning Publications Co.pp Hearst, M. A., Dumais S. T., Osman E., et al Support Vector Machines, IEEE J. Intelligent Systems and their Applications, 13(4),pp Guyon, I. and Elisseeff, A. An introduction to variable and feature selection, The Journal of Machine Learning Research, 3,pp Androutsopoulos, I., Koutsias J., Chandrinos K.V., et al An evaluation of naive bayesian anti-spam filtering C, Workshop on Machine Learning in the New Information Age, Kent, J. T. Information gain and a general measure of correlation, Biometrika, 70(1), pp Fraser, A. M. and Swinney, H. L. Independent coordinates for strange attractors from mutual information, Physical review A, 33(2), pp
The Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationDiscovering Advertisement Links by Using URL Text
017 3rd International Conference on Computational Systems and Communications (ICCSC 017) Discovering Advertisement Links by Using URL Text Jing-Shan Xu1, a, Peng Chang, b,* and Yong-Zheng Zhang, c 1 School
More informationAn Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationKeywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationAn Adaptive Threshold LBP Algorithm for Face Recognition
An Adaptive Threshold LBP Algorithm for Face Recognition Xiaoping Jiang 1, Chuyu Guo 1,*, Hua Zhang 1, and Chenghua Li 1 1 College of Electronics and Information Engineering, Hubei Key Laboratory of Intelligent
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationSpam UF. Use and customization instructions for the Barracuda Spam service at the University of Florida.
Spam Quarantine @ UF Use and customization instructions for the Barracuda Spam service at the University of Florida. Graff, Randy A 10/10/2008 Contents Overview... 2 Getting Started... 2 Actions... 2 Whitelist/Blacklist...
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationApplication of Improved Lzc Algorithm in the Discrimination of Photo and Text ChengJing Ye 1, a, Donghai Zeng 2,b
2016 International Conference on Information Engineering and Communications Technology (IECT 2016) ISBN: 978-1-60595-375-5 Application of Improved Lzc Algorithm in the Discrimination of Photo and Text
More informationA Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence
2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationCollaborative Spam Mail Filtering Model Design
I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme
More informationImproved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *
2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5
More informationRETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationFeature weighting classification algorithm in the application of text data processing research
, pp.41-47 http://dx.doi.org/10.14257/astl.2016.134.07 Feature weighting classification algorithm in the application of text data research Zhou Chengyi University of Science and Technology Liaoning, Anshan,
More informationTime Series Clustering Ensemble Algorithm Based on Locality Preserving Projection
Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology
More informationIdentifying Important Communications
Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our
More informationForward Feature Selection Using Residual Mutual Information
Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics
More informationA Reputation-based Collaborative Approach for Spam Filtering
Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for
More informationSpam Classification Documentation
Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:
More informationUsage Guide to Handling of Bayesian Class Data
CAMELOT Security 2005 Page: 1 Usage Guide to Handling of Bayesian Class Data 1. Basics Classification of textual data became much more importance in the actual time. Reason for that is the strong increase
More informationClassification and Summarization: A Machine Learning Approach
Email Classification and Summarization: A Machine Learning Approach Taiwo Ayodele Rinat Khusainov David Ndzi Department of Electronics and Computer Engineering University of Portsmouth, United Kingdom
More informationFiltering Spam Using Fuzzy Expert System 1 Hodeidah University, Faculty of computer science and engineering, Yemen 3, 4
Filtering Spam Using Fuzzy Expert System 1 Siham A. M. Almasan, 2 Wadeea A. A. Qaid, 3 Ahmed Khalid, 4 Ibrahim A. A. Alqubati 1, 2 Hodeidah University, Faculty of computer science and engineering, Yemen
More informationOUR TOP DATA SOURCES AND WHY THEY MATTER
OUR TOP DATA SOURCES AND WHY THEY MATTER TABLE OF CONTENTS INTRODUCTION 2 MAINSTREAM WEB 3 MAJOR SOCIAL NETWORKS 4 AUDIENCE DATA 5 VIDEO 6 FOREIGN SOCIAL NETWORKS 7 SYNTHESIO DATA COVERAGE 8 1 INTRODUCTION
More informationA Survey And Comparative Analysis Of Data
A Survey And Comparative Analysis Of Data Mining Techniques For Network Intrusion Detection Systems In Information Security, intrusion detection is the act of detecting actions that attempt to In 11th
More informationDiscriminate Analysis
Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive
More informationCopyright 2011 please consult the authors
Alsaleh, Slah, Nayak, Richi, Xu, Yue, & Chen, Lin (2011) Improving matching process in social network using implicit and explicit user information. In: Proceedings of the Asia-Pacific Web Conference (APWeb
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationApplications of Machine Learning on Keyword Extraction of Large Datasets
Applications of Machine Learning on Keyword Extraction of Large Datasets 1 2 Meng Yan my259@stanford.edu 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
More informationAnalysis on the technology improvement of the library network information retrieval efficiency
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationTRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK
TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.
More informationResearch on Design and Application of Computer Database Quality Evaluation Model
Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the
More information1. Access to Chinese Academic Journal Web
1. Access to Chinese Academic Journal Web Access to Chinese Academic Journal Web is via East View Publications s home page at http://china.eastivew.com, then click on the link China Academic Journals.
More informationThe Impact of Information System Risk Management on the Frequency and Intensity of Security Incidents Original Scientific Paper
The Impact of Information System Risk Management on the Frequency and Intensity of Security Incidents Original Scientific Paper Hrvoje Očevčić Addiko Bank d.d. Slavonska avenija 6, Zagreb, Croatia hrvoje.ocevcic@gmail.com
More informationApplication of Redundant Backup Technology in Network Security
2018 2nd International Conference on Systems, Computing, and Applications (SYSTCA 2018) Application of Redundant Backup Technology in Network Security Shuwen Deng1, Siping Hu*, 1, Dianhua Wang1, Limin
More informationFeature-weighted k-nearest Neighbor Classifier
Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka
More informationCORE for Anti-Spam. - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow
CORE for Anti-Spam - Innovative Spam Protection - Mastering the challenge of spam today with the technology of tomorrow Contents 1 Spam Defense An Overview... 3 1.1 Efficient Spam Protection Procedure...
More informationFuzzy Entropy based feature selection for classification of hyperspectral data
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering NIT Kurukshetra, 136119 mpce_pal@yahoo.co.uk Abstract: This paper proposes to use
More informationText Classification for Spam Using Naïve Bayesian Classifier
Text Classification for E-mail Spam Using Naïve Bayesian Classifier Priyanka Sao 1, Shilpi Chaubey 2, Sonali Katailiha 3 1,2,3 Assistant ProfessorCSE Dept, Columbia Institute of Engg&Tech, Columbia Institute
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More information10601 Machine Learning. Model and feature selection
10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior
More informationA NEW HYBRID APPROACH FOR NETWORK TRAFFIC CLASSIFICATION USING SVM AND NAÏVE BAYES ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationEmpirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee
A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) Empirical risk minimization (ERM) Recall the definitions of risk/empirical risk We observe the
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More informationBayesian Spam Filtering Using Statistical Data Compression
Global Journal of researches in engineering Numerical Methods Volume 11 Issue 7 Version 1.0 December 2011 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
More informationNon-ML Anti-Spamming: A Role Based Solution
Non-ML Anti-Spamming: A Role Based Solution Anthony Y. Fu, Email: anthony@cs.cityu.edu.hk WebPage: http://www.cs.cityu.edu.hk/~anthony Department of Computer Science, City University of Hong Kong Hong
More informationComparing Univariate and Multivariate Decision Trees *
Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr
More informationClassifying Twitter Data in Multiple Classes Based On Sentiment Class Labels
Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels Richa Jain 1, Namrata Sharma 2 1M.Tech Scholar, Department of CSE, Sushila Devi Bansal College of Engineering, Indore (M.P.),
More informationApplying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?
Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help? Olivier Bousquet, Google, Zürich, obousquet@google.com June 4th, 2007 Outline 1 Introduction 2 Features 3 Minimax
More informationK-means clustering based filter feature selection on high dimensional data
International Journal of Advances in Intelligent Informatics ISSN: 2442-6571 Vol 2, No 1, March 2016, pp. 38-45 38 K-means clustering based filter feature selection on high dimensional data Dewi Pramudi
More informationLinear Discriminant Analysis in Ottoman Alphabet Character Recognition
Linear Discriminant Analysis in Ottoman Alphabet Character Recognition ZEYNEB KURT, H. IREM TURKMEN, M. ELIF KARSLIGIL Department of Computer Engineering, Yildiz Technical University, 34349 Besiktas /
More informationPerformance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM
Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Lu Chen and Yuan Hang PERFORMANCE DEGRADATION ASSESSMENT AND FAULT DIAGNOSIS OF BEARING BASED ON EMD AND PCA-SOM.
More information2. Design Methodology
Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily
More informationRecord Linkage using Probabilistic Methods and Data Mining Techniques
Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationTemperature Calculation of Pellet Rotary Kiln Based on Texture
Intelligent Control and Automation, 2017, 8, 67-74 http://www.scirp.org/journal/ica ISSN Online: 2153-0661 ISSN Print: 2153-0653 Temperature Calculation of Pellet Rotary Kiln Based on Texture Chunli Lin,
More informationA PSO-based Generic Classifier Design and Weka Implementation Study
International Forum on Mechanical, Control and Automation (IFMCA 16) A PSO-based Generic Classifier Design and Weka Implementation Study Hui HU1, a Xiaodong MAO1, b Qin XI1, c 1 School of Economics and
More informationReconstruction-based Classification Rule Hiding through Controlled Data Modification
Reconstruction-based Classification Rule Hiding through Controlled Data Modification Aliki Katsarou, Aris Gkoulalas-Divanis, and Vassilios S. Verykios Abstract In this paper, we propose a reconstruction
More informationVisual Analysis of Lagrangian Particle Data from Combustion Simulations
Visual Analysis of Lagrangian Particle Data from Combustion Simulations Hongfeng Yu Sandia National Laboratories, CA Ultrascale Visualization Workshop, SC11 Nov 13 2011, Seattle, WA Joint work with Jishang
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationModel-based segmentation and recognition from range data
Model-based segmentation and recognition from range data Jan Boehm Institute for Photogrammetry Universität Stuttgart Germany Keywords: range image, segmentation, object recognition, CAD ABSTRACT This
More informationVisoLink: A User-Centric Social Relationship Mining
VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.
More informationResearch Domain Selection using Naive Bayes Classification
I.J. Mathematical Sciences and Computing, 2016, 2, 14-23 Published Online April 2016 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijmsc.2016.02.02 Available online at http://www.mecs-press.net/ijmsc
More informationSchematizing a Global SPAM Indicative Probability
Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS MARIOS POULOS SOZON PAPAVLASSOPOULOS Department of Management Science and Technology Athens University of Economics and Business Athens,
More informationMetric and Identification of Spatial Objects Based on Data Fields
Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 25-27, 2008, pp. 368-375 Metric and Identification
More informationThe Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data
The Applicability of the Perturbation Model-based Privacy Preserving Data Mining for Real-world Data Li Liu, Murat Kantarcioglu and Bhavani Thuraisingham Computer Science Department University of Texas
More informationBUAA AUDR at ImageCLEF 2012 Photo Annotation Task
BUAA AUDR at ImageCLEF 2012 Photo Annotation Task Lei Huang, Yang Liu State Key Laboratory of Software Development Enviroment, Beihang University, 100191 Beijing, China huanglei@nlsde.buaa.edu.cn liuyang@nlsde.buaa.edu.cn
More informationCONTENTS IN DETAIL PART I AN INTRODUCTION TO SPAM FILTERING INTRODUCTION 1 THE HISTORY OF SPAM 3 2 HISTORICAL APPROACHES TO FIGHTING SPAM 25
CONTENTS IN DETAIL INTRODUCTION xvii PART I AN INTRODUCTION TO SPAM FILTERING 1 THE HISTORY OF SPAM 3 The Definition of Spam... 4 The Very First Spam... 4 Spam: The Early Years... 7 Jay-Jay s College Fund...
More informationNaïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationInformation Retrieval
Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio) Relevance
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationFeature Ranking Using Linear SVM
JMLR: Workshop and Conference Proceedings 3: 53-64 WCCI2008 workshop on causality Feature Ranking Using Linear SVM Yin-Wen Chang Chih-Jen Lin Department of Computer Science, National Taiwan University
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationCombination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset
International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran
More informationOn Effective Classification via Neural Networks
On Effective E-mail Classification via Neural Networks Bin Cui 1, Anirban Mondal 2, Jialie Shen 3, Gao Cong 4, and Kian-Lee Tan 1 1 Singapore-MIT Alliance, National University of Singapore {cuibin, tankl}@comp.nus.edu.sg
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationSpam Management with PureMessage
Email Spam Management with PureMessage The University has implemented a new spam management system that offers several improvements over the previous service. Not only does PureMessage detect spam more
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationCOMP61011 Foundations of Machine Learning. Feature Selection
OMP61011 Foundations of Machine Learning Feature Selection Pattern Recognition: The Early Days Only 200 papers in the world! I wish! Pattern Recognition: The Early Days Using eight very simple measurements
More informationAnalytical Support of Financial Footnotes Analysis
Analytical Support of Financial Footnotes Analysis XBRL Conference Maryam Heidari Maryam.heidari@bwl.tu-freiberg.de 01.06.2016 Prof. Dr. Carsten Felden Technische Universität Bergakademie Freiberg (Sachsen)
More informationPrairie View A&M University Managing your s. Office of Information Resource Management
Prairie View A&M University Managing your Emails Why Manage Emails? Its important to manage your emails because: It makes it easier to retrieve information. Lowers the storage costs associated with keeping
More informationLayer by Layer: Protecting from Attack in Office 365
Layer by Layer: Protecting Email from Attack in Office 365 Office 365 is the world s most popular office productivity suite, with user numbers expected to surpass 100 million in 2017. With the vast amount
More informationVECTOR SPACE CLASSIFICATION
VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture
More informationCS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM
1 Introduction In this programming project, we are going to do a simple image segmentation task. Given a grayscale image with a bright object against a dark background and we are going to do a binary decision
More informationImprovements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1
3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao
More informationNON-CENTRALIZED DISTINCT L-DIVERSITY
NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationDecision Science Letters
Decision Science Letters 3 (2014) 439 444 Contents lists available at GrowingScience Decision Science Letters homepage: www.growingscience.com/dsl Identifying spam e-mail messages using an intelligence
More information