The Comparative Study of Machine Learning Algorithms in Text Data Classification*
|
|
- Cody Terry
- 5 years ago
- Views:
Transcription
1 The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification algorithm is one of the most important research fields in data mining. Accurate classification of text data is an important basis for information processing and text retrieval technology, and has a wide range of applications. The traditional text categorization models based on knowledge engineering and expert system are lack of flexibility, so it is of great theoretical and practical significance to study the performance of some machine learning algorithms in text categorization. In this paper filter spam methods using several machine learning algorithms are discussed. The Python language was applied to classify the text data. Then, the performance of different algorithms in text classification, such as polynomial model of naive Bayesian algorithm, Bernoulli model of naive Bayesian algorithm, the support vector machine algorithm and the K-nearest neighbor algorithm, were compared. In order to filter out some noise words that appear frequently but without valid information in the text, we proposed Chi-square test method to reduce the feature dimensions which improved the classification algorithm performance and classifier running speed. Furthermore, the accuracy, recall rate and F1-score of the above four algorithms with different feature dimensions are all compared. The numerical example showed that the support vector machine algorithm had higher accuracy in text categorization, but it runed slowly. The naive Bayesian algorithm was simple and fast, and had obvious superiority. Keywords naive Bayesian algorithm; text classification; support vector machine; K-nearest neighbor; Chi-square test method I. INTRODUCTION With the rapid development of computer related technology, the internet and its derivative resources produce huge amounts of text data. How to classify logically the text data according to the needs of consult, storage and application has become an increasingly important issue. Therefore, data mining and classification technology based on text content has gradually become the focus of attention. The function of text classification algorithm is to determine its category according to some characteristics of the text and the given category mark set in advance. The traditional text classification method is based on knowledge engineering and expert system, and has great defects in flexibility and classification effect. It is more and more unsuitable for the demand of increasingly complicated text data classification system. Since 1990s, the application of machine learning in text classification has received extensive attention [1-7]. A variety of machine learning algorithms have been widely used in text classification research, such as decision tree, support vector machine, naïve Bayesian algorithm, K-nearest neighbor algorithm, boosting algorithm, random forest algorithm, etc. The general rule of text classification algorithm is that the system summarizes the regularity of classification according to the data of some samples in the known classification, and establishes the discrimination rules of classification. When the new text data is encountered, the classification of the text is determined according to the identified rules. That is, automatic text classification can construct classifiers through supervised learning, so as to categorize automatically for new given text. This paper first introduced the classification rules and verification process of naive Bayesian algorithm, support vector machine, K-nearest neighbor algorithm; then used spam classification example to compare the performance and running speed of naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm, K-nearest neighbor algorithm. II. ALGORITHM CLASSIFICATION RULE Suppose the input space X R n be the set of n dimensional vectors, the output space c 1,c 2,,c k be the set of classification. Input feature vector X X, and the output variable Y is a class label. A. Naive Bayesian Algorithm Naive Bayesian algorithm is a common classification algorithm and is easy to implement and has higher efficiency in learning and prediction. Let X be a random vector defined on the input space, Y be a random variable defined on the output space, and P X,Y is joint probability distribution of X and Y. The training data set T x 1, y 1, x 2, y 2,, x N, y N is produced independently and identically distributed by P X,Y. The algorithm learns the joint probability distribution of the training data set according to the assumption of the characteristics conditional independence. For a certain input vector x, the posterior probability is calculated according to the Bayesians theorem, and the classification with the largest posterior probability is used as the output classification. Wang xin volume III issue xi nov 2017 Page 42
2 y arg max P Y c k X x k B. K-Nearest Neighbor Algorithm P X x Y c k P Y c k k P X x Y c k P Y c k k 1, 2,, N The K-nearest neighbor algorithm is one of the simplest machine learning algorithms, and its basic idea is to find the nearest K samples from the training data set T x 1, y 1, x 2, y 2,, x N, y N, and the input instance x will be classified as the largest proportion of classification in the K samples. The common method of calculating distance is L 1 p p n x, x j x ( l ) x ( l ) p i i j l 1 Where x i ( l ) is the lth component of the vector x i and p 1 ; L p is Euclidean distance when p 2. When K equals the number N of training data, any input instance will be classified as the category with the largest proportion of training data set. When K=1, the input instance x will be classified as be the category of the nearest instance of x. In general, we usually choose a smaller and more appropriate K value by cross testing. C. Support Vector Machine Algorithm Support vector machine (SVM) is a commonly used algorithm in machine learning. The proposed algorithm is applied to the two classification problem. After many years of research, it has been applied in the multi-classification problem. The basic principle is to find the maximum separation hyperplane in the feature space and divides the samples of training data set into two categories. Furthermore, the minimization of the empirical risk and confidence interval are realized by seeking to improve the minimum structure risk of machine learning generalization ability. It can also achieve the purpose of obtaining good statistical rules when the sample size is small [8]. For a training data set T x 1, y 1, x 2, y 2,, x N, y N, i where y 1, 1, the following optimization problem is constructed and solved: min 1 w 2 w, b 2 s.t. y i w x i b 1 0, i 1, 2,, N Then, the optimal decision w *,b * is obtained, and the classification decision function can be expressed as follows f x sign w* x b* D. Evaluation Indexes of Algorithm For classification algorithms, especially for two classification algorithms, some evaluation indexes, such as accuracy, recall rate and comprehensive evaluation index (F1- Measure), are commonly used. The total numbers of four cases predicted by the classifier on the test data set are recorded respectively as: : the number that instances with positive class are classified as the positive class; FN: the number that instances with positive class are classified as the negative class; FP: the number that instances with negative class are classified as the positive class; TN: the number that instances with negative class are classified as the negative class. The accuracy indicates the proportion of positive classes that are correctly classified in all the quantities that are predicted to be positive: P FP The recall rate indicates the proportion of positive classes that are correctly classified in all the quantities that are originally positive: R FN The comprehensive evaluation index is the weighted average of accuracy and recall rate: F 2 1 P R 2 P R For simplicity, it's generally advisable that 1, namely F1- Measure index. The confusion matrix is often used to visually observe the classification results of the classifier, and the confusion matrix is as follows: True result TABLE I. FN THE CONFUSION MATRIX Prediction result III. CONSTRUCTION OF CLASSIFIER In order to evaluate the performance of the model better, it is necessary to validate the model. Before using the training model, the total data set is divided into training data set and test data set to solve the error brought by simple cross validation. In this paper, we use 10-fold cross-validation, namely the data set is divided into ten subsets randomly. A total of ten tests are done for the classification model. During the ten testing processes, 9 subsets of 10 subsets are used as the training sets, and the 1 subset is used as the testing subset each time. The accuracy, recall rate and F1-index are FP TN Wang xin volume III issue xi nov 2017 Page 43
3 calculated, and then the average value of each index from 10 testing results is as the evaluation index of model performance. In practice, the K-fold method of the model selection module in the third party library scikit-learn of Python is used to conduct the 10-fold cross-validation. Further, metrics module in scikit-learn is used to form confusion matrix and analyze the performance of classifier model. In the process of text classification, a very serious problem is that taking words of text as features will result in the curse of dimensionality when the sample size is too large. Consequently, the training speed is too slow. So it is necessary to reduce the feature dimension and enhance the accuracy of algorithm. The dimensionality reduction method in this paper is mainly implemented by the Chi-square method, which is in the feature selection module of Python third party library scikit-learn. The process of verification and application of the algorithm is shown in Figure 1 and Figure 2, respectively: Raw data set Regular filtering text segmentation Chinese data set Vectorization data set Data matrix Matrix Form stop wordlist dimension reduction Training set Training data set Classifier Prediction sample Regular filtering text segmentation Chinese sample Filter words and vectorization samples Prediction matrix Prediction matrix Classification result Raw data set Fig. 2. Algorithm application process Training data set Classifier formation Regular filtering text segmentation Chinese data set Data matrix Vectorization data set Dimension reduction matrix Training data using classification algorithms Matrix dimension reduction Classification of test data sets Model evaluation 10-fold cross validation Testing data set Calculation evaluation index Fig. 1. Algorithm validation process A. Numerical Example In recent years, has replaced traditional mail as tool for people's daily communication because of its advantages of simplicity, convenience, fast propagation and wide dissemination. However, according to "2016 spam and phishing attacks report of the Kabasiji laboratory in [9], about 20% of the spam s spread ransomware Trojan. Spam s not only occupy memory space, but also mix with commercial advertising, fraud information, even with a virus, which seriously affect people's life. Therefore, we uses spam discrimination as example of text classification algorithm. The numerical example uses s as the training data. Because the format of the is complex, we can filter Chinese character by Python regular library. Because of repeated sending of spam mail, there may be duplicate items. So there are 7062 non duplicate s retained in data set after deleting duplicate items. Word segmentation for each is done by the library, and the key words are extracted to form the feature matrix. The part of the feature matrix is shown in Table II: TABLE II. THE FEATURE MATRIX integr dema form soft delet cons meet Spon ation nd ware e ult ing Sor Wang xin volume III issue xi nov 2017 Page 44
4 According to the algorithm application process shown in Fig. 2, naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm and the K- nearest neighbor algorithm are applied to the training data set, and the performances of the four algorithms are compared. The confusion matrices of the four algorithms are shown in Fig. 3- Fig. 6. Fig. 5. The confusion matrix of support vector machine algorithm The accuracy of the SVM algorithm on this training data set is 0.978, the recall rate is 0.975, and the F1-measure is 0.977, which is calculated by Fig. 5. Fig. 3. The confusion matrix of Naive Bayesian polynomial model By Fig. 3, it can be calculated that the accuracy of naive Bayesian polynomial model on this training data set is 0.963, the recall rate is 0.961, and the F1-measure is Fig. 6. The confusion matrix of the K-nearest neighbor algorithm It can be calculated by Fig. 6 the accuracy of the K-nearest neighbor algorithm on this training data set is 0.967, the recall rate is 0.882, and the F1-measure is The training times for the four algorithms are shown respectively in Table III: TABLE III. THE REQUIRED TRAINING TIME FOR THE FOUR ALGORITHMS Algorithms Required Time (seconds) naive Bayesian polynomial model 6 naive Bayesian Bernoulli model 11 support vector machine 780 the K-nearest neighbor 447 Fig. 4. The confusion matrix of Naive Bayesian Bernoulli model The accuracy of naive Bayesian Bernoulli model on this training data set is 0.948, the recall rate is 0.964, and the F1- measure is 0.956, which are seen by Fig. 4. According to the classification results, the three performance indexes of SVM algorithm are better than those of other models, but the training time of this algorithm is too long, which will bring bad user experience for the larger amount of data in practical applications. Naive Bayesian polynomial model has higher precision, lower recall rate, higher F1-measure and shorter training time than the naive Bayesian Bernoulli model. So naive Bayesian polynomial model is better than naive Bayesian Bernoulli model. Moreover, there is have a large gap between the K-nearest neighbor algorithm and other algorithms on each index Wang xin volume III issue xi nov 2017 Page 45
5 B. Comparison of Algorithms in Different Dimensions The Chi-square test dimensionality reduction algorithm has a very good theoretical basis of statistics. It is a widely used hypothesis testing method, which is often used to test the correlation between two classification variables. The basic idea is to calculate the Chi-square value between the theoretical value and the observation value. The smaller the Chi-square value is, the smaller the deviation between the observation value and the theoretical value; otherwise, the deviation between the observation value and the observation value is greater. In Fig.7-Fig.9 the horizontal coordinates represent the ordinate dimension, and the vertical coordinates represent accuracy, recall rate and F1 metric respectively. It can be seen that naive Bayesian polynomial model, naive Bayesian Bernoulli model, support vector machine algorithm are all almost not affected by the dimension size and have stable performance. The K-nearest neighbor algorithm is sensitive to the dimension size, which may be because the K-nearest neighbor algorithm is affected by K-value in different dimensions. In this paper K=4 and dimensions are chosen, which is relatively stable and has better effect. Fig. 9. The F1-measure of four algorithms under different dimensions IV. CONCLUSON In the two models of the naive Bayesian algorithm, the polynomial model has higher accuracy and faster speed in text classification. Spam classification should pay more attention to accuracy, and ensure that users can receive normal mail when they get rid of spam more effectively. The naive Bayesian algorithm has a simple training process, although it is less accurate than SVM, but it is much faster than SVM. When the amount of data is huge, the advantage of fast speed is particularly obvious. The Chi-square test can help the naive Bayesian algorithm to reduce the dimensionality and improve the performance of the algorithm. It can achieve the effect of filtering noisy words and unintentional words, and can make the algorithm more effective. ACKNOWLEDGMENT This research was supported by the National Natural Science Foundation of China ( ) and Qin Xin Talents Cultivation Program (QXTCP B201705), Beijing Information Science & Technology University. Fig. 7. The accuracy of four algorithms under different dimensions Fig. 8. The recall rate of four algorithms under different dimensions REFERENCES [1] Y.Yang. An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, vol 1, pp , [2] F.Sebastiani. Machine learning in automated text categorization, Journal of ACM Computing Surveys, vol 34, pp. 1-47, [3] J. Sun,J.Xiao. Study on feedback learning of SVM-based chinese text classification. Control and desicion, vol 19, pp , Augest [4] H.Kim, P. Howland, H.Park. Dimension reduction in text classification with support vectormachines, Journal of Machine Learning Research, vol.6, pp , [5] Z. Yang. Research on text classification algorithms based on machine learning, University of Guangxi, [6] X. Zhang. Review of machine learning in Automatic text categorization, Jouranl of the China society for scientific and technical information, vol.25, pp , December [7] J. Lai. Simulation Research of Text Categorization Based on Data Mining, Computer simulation, vol 28, pp , December [8] H. Li. Statistical learning method, Beijing: Tsinghua University Press, [9] [ ]. Wang xin volume III issue xi nov 2017 Page 46
Bayesian Spam Detection System Using Hybrid Feature Selection Method
2016 International Conference on Manufacturing Science and Information Engineering (ICMSIE 2016) ISBN: 978-1-60595-325-0 Bayesian Spam Detection System Using Hybrid Feature Selection Method JUNYING CHEN,
More informationDiscovering Advertisement Links by Using URL Text
017 3rd International Conference on Computational Systems and Communications (ICCSC 017) Discovering Advertisement Links by Using URL Text Jing-Shan Xu1, a, Peng Chang, b,* and Yong-Zheng Zhang, c 1 School
More informationAn Improved KNN Classification Algorithm based on Sampling
International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1,
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationPrediction of traffic flow based on the EMD and wavelet neural network Teng Feng 1,a,Xiaohong Wang 1,b,Yunlai He 1,c
2nd International Conference on Electrical, Computer Engineering and Electronics (ICECEE 215) Prediction of traffic flow based on the EMD and wavelet neural network Teng Feng 1,a,Xiaohong Wang 1,b,Yunlai
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationUnknown Malicious Code Detection Based on Bayesian
Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai
More informationSNS College of Technology, Coimbatore, India
Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationData Mining Classification: Alternative Techniques. Imbalanced Class Problem
Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems
More informationImage and Video Quality Assessment Using Neural Network and SVM
TSINGHUA SCIENCE AND TECHNOLOGY ISSN 1007-0214 18/19 pp112-116 Volume 13, Number 1, February 2008 Image and Video Quality Assessment Using Neural Network and SVM DING Wenrui (), TONG Yubing (), ZHANG Qishan
More informationK- Nearest Neighbors(KNN) And Predictive Accuracy
Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.
More informationVideo annotation based on adaptive annular spatial partition scheme
Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationNetwork Traffic Classification Based on Deep Learning
Journal of Physics: Conference Series PAPER OPEN ACCESS Network Traffic Classification Based on Deep Learning To cite this article: Jun Hua Shu et al 2018 J. Phys.: Conf. Ser. 1087 062021 View the article
More informationSupervised Learning Classification Algorithms Comparison
Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------
More informationPerformance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM
Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Lu Chen and Yuan Hang PERFORMANCE DEGRADATION ASSESSMENT AND FAULT DIAGNOSIS OF BEARING BASED ON EMD AND PCA-SOM.
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationChapter 3: Supervised Learning
Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example
More informationMachine Learning Classifiers and Boosting
Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationSupport vector machines
Support vector machines Cavan Reilly October 24, 2018 Table of contents K-nearest neighbor classification Support vector machines K-nearest neighbor classification Suppose we have a collection of measurements
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationKarami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.
Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review
More informationA Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence
2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da
More informationPerformance Evaluation of Various Classification Algorithms
Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------
More informationA Data Classification Algorithm of Internet of Things Based on Neural Network
A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationResearch on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,
More informationLecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy
Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy Machine Learning Dr.Ammar Mohammed Nearest Neighbors Set of Stored Cases Atr1... AtrN Class A Store the training samples Use training samples
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Applying Machine Learning for Fault Prediction Using Software
More informationAn Abnormal Data Detection Method Based on the Temporal-spatial Correlation in Wireless Sensor Networks
An Based on the Temporal-spatial Correlation in Wireless Sensor Networks 1 Department of Computer Science & Technology, Harbin Institute of Technology at Weihai,Weihai, 264209, China E-mail: Liuyang322@hit.edu.cn
More informationImproving the Efficiency of Fast Using Semantic Similarity Algorithm
International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year
More informationDATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationA Novel Identification Approach to Encryption Mode of Block Cipher Cheng Tan1, a, Yifu Li 2,b and Shan Yao*2,c
th International Conference on Sensors, Mechatronics and Automation (ICSMA 16) A Novel Approach to Encryption Mode of Block Cheng Tan1, a, Yifu Li 2,b and Shan Yao*2,c 1 Science and Technology on Communication
More informationResearch and Application of E-Commerce Recommendation System Based on Association Rules Algorithm
Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm Qingting Zhu 1*, Haifeng Lu 2 and Xinliang Xu 3 1 School of Computer Science and Software Engineering,
More informationYunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction
More information2. On classification and related tasks
2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.
More informationAutomatic Shadow Removal by Illuminance in HSV Color Space
Computer Science and Information Technology 3(3): 70-75, 2015 DOI: 10.13189/csit.2015.030303 http://www.hrpub.org Automatic Shadow Removal by Illuminance in HSV Color Space Wenbo Huang 1, KyoungYeon Kim
More informationA Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries
University of Massachusetts Amherst ScholarWorks@UMass Amherst Tourism Travel and Research Association: Advancing Tourism Research Globally 2016 ttra International Conference A Random Forest based Learning
More informationRETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu
More informationUsing Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear
Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning
More informationECLT 5810 Evaluation of Classification Quality
ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationComprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority To cite this article:
More informationCSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13
CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit
More informationFace Recognition Based on LDA and Improved Pairwise-Constrained Multiple Metric Learning Method
Journal of Information Hiding and Multimedia Signal Processing c 2016 ISSN 2073-4212 Ubiquitous International Volume 7, Number 5, September 2016 Face Recognition ased on LDA and Improved Pairwise-Constrained
More informationK-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection
K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationVideo Inter-frame Forgery Identification Based on Optical Flow Consistency
Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong
More informationAn Empirical Study on Lazy Multilabel Classification Algorithms
An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros, Grigorios Tsoumakas and Ioannis Vlahavas Machine Learning & Knowledge Discovery Group Department of Informatics
More informationClassification. Slide sources:
Classification Slide sources: Gideon Dror, Academic College of TA Yaffo Nathan Ifill, Leicester MA4102 Data Mining and Neural Networks Andrew Moore, CMU : http://www.cs.cmu.edu/~awm/tutorials 1 Outline
More informationThe Design and Implementation of Disaster Recovery in Dual-active Cloud Center
International Conference on Information Sciences, Machinery, Materials and Energy (ICISMME 2015) The Design and Implementation of Disaster Recovery in Dual-active Cloud Center Xiao Chen 1, a, Longjun Zhang
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationUnsupervised Feature Selection for Sparse Data
Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 20: 10/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationLarge Scale Data Analysis Using Deep Learning
Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting
More informationData mining: concepts and algorithms
Data mining: concepts and algorithms Practice Data mining Objective Exploit data mining algorithms to analyze a real dataset using the RapidMiner machine learning tool. The practice session is organized
More informationR (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.
Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationUniversity of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka
Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationUsing Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions
Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions Offer Sharabi, Yi Sun, Mark Robinson, Rod Adams, Rene te Boekhorst, Alistair G. Rust, Neil Davey University of
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationLecture 9: Support Vector Machines
Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationData Preprocessing. Data Preprocessing
Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationResearch on Design and Application of Computer Database Quality Evaluation Model
Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationAn Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization
An Immune Concentration Based Virus Detection Approach Using Particle Swarm Optimization Wei Wang 1,2, Pengtao Zhang 1,2, and Ying Tan 1,2 1 Key Laboratory of Machine Perception, Ministry of Eduction,
More informationA Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression
Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationMetrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to
More informationResearch on outlier intrusion detection technologybased on data mining
Acta Technica 62 (2017), No. 4A, 635640 c 2017 Institute of Thermomechanics CAS, v.v.i. Research on outlier intrusion detection technologybased on data mining Liang zhu 1, 2 Abstract. With the rapid development
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationSathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 8, Issue 5 (Jan. - Feb. 2013), PP 70-74 Performance Analysis Of Web Page Prediction With Markov Model, Association
More informationKBSVM: KMeans-based SVM for Business Intelligence
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationText Classification for Spam Using Naïve Bayesian Classifier
Text Classification for E-mail Spam Using Naïve Bayesian Classifier Priyanka Sao 1, Shilpi Chaubey 2, Sonali Katailiha 3 1,2,3 Assistant ProfessorCSE Dept, Columbia Institute of Engg&Tech, Columbia Institute
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14
More informationAn Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid
An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid Demin Wang 2, Hong Zhu 1, and Xin Liu 2 1 College of Computer Science and Technology, Jilin University, Changchun
More informationStudy on Classifiers using Genetic Algorithm and Class based Rules Generation
2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules
More informationIdentifying Important Communications
Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our
More informationChapter 6 Evaluation Metrics and Evaluation
Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific
More informationImprovement of SURF Feature Image Registration Algorithm Based on Cluster Analysis
Sensors & Transducers 2014 by IFSA Publishing, S. L. http://www.sensorsportal.com Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis 1 Xulin LONG, 1,* Qiang CHEN, 2 Xiaoya
More informationData mining techniques for actuaries: an overview
Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of
More information