PREPROCESSING THE FEATURE SELECTION ON MINING ALGORITHM - REVIEW

Size: px
Start display at page:

Download "PREPROCESSING THE FEATURE SELECTION ON MINING ALGORITHM - REVIEW"

Transcription

1 PREPROCESSING THE FEATURE SELECTION ON MINING ALGORITHM - REVIEW G.VENKATESWARAN ASSISTANT PROFESSOR, Department of IT &BCA, NPR ARTS AND SCIENCE COLLEGE, NATHAM, ABSTRACT Filtration is the process by which it can refine the unwanted data from dataset. It is the major process in data mining technique. It makes perfection of the dataset to process a learning algorithm on it. Many algorithms can be very helpful to process the filtered data from metadata. Data from UCI data repository can be procedure of proposed method and evaluate the data from it. This Paper mainly focused on preprocessing the data and evaluates the data. In this feature selection, it can be used evaluate each attributes by filtering based algorithm. It may also used to classify the whole dataset by using clustering and other specifying algorithm. Finally, Preprocessing data can be used to refine the data from whole dataset. KEYWORDS: Filtration; preprocessing; INTRODUCTION Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful and ultimately understandable patterns in data.data mining is the process of extracting information or patterns from large database, data warehouses, XML repository, etc. Also data mining is known as one of the core processes of Knowledge Discovery in Database (KDD).In machine learning and statistics, feature selection also known as attribute selection or variable subset selection. It is the process of selecting a subset of relevant features for model construction. Feature selection techniques are a subset of the more general field of feature extraction. Feature extraction can creates new attributes that obtained as original attributes of the dataset, whereas attributes filtering that returns a relevant set of the attributes. An attribute selection is generally categories into main four categories as Filter, Wrapper, Embedded and Hybrid Method. Wrapper Method A predictive model is used in wrapper method score feature subsets. To train a model each new subset is used, which is tested on a hold-out set. On that hold-out set, counting the number of mistakes made (the error rate of the model) gives the score for that subset. G.VENKATESWARAN 1

2 Filter Method Instead of the error rate, proxy measure is used in filter method to score a feature subset. This measure is very fast to compute. Filters are less computationally intensive than wrappers, they produce feature set which is not tuned to a specific type of predictive model. Embedded Method Embedded method is a catch-all group of techniques that perform feature selection as part of the model construction process. These approaches are between filters and wrappers in terms of computational complexity. The embedded method incorporate feature selections and are specific to given learning algorithm so it is efficient than other methods. Hybrid Method Filter and wrapper method are combined in Hybrid method. To reduce the search space, filter method is used, that will be considered by subsequent wrapper. These methods mainly focus on combination of filter and wrapper methods in order to achieve best performance with particular learning algorithm with similar time complexity of the filter methods. LITERATURE REVIEW: In paper [1], Data mining has one of the key problems that arise in a great variety of fields, including pattern recognition and machine learning, is the so-called feature selection. It can be defined as finding M relevant features from N original features. Algorithms that perform feature selection can generally be categorized into two classes: Wrappers and Filters. The former consider the feature selection as a preprocessing step and independent of the learning algorithm; for the latter, feature selection is wrapped around the learning algorithm and the result of learning algorithm is used as the evaluation criterion. In general, the characteristics of Filters are low time cost and not better effect, on the contrary, the time cost of Wrappers is high for its calling the learning algorithm to evaluate candidate subset of considered features, but the effect is better to predetermined learning algorithm. In recent years, data has become increasingly larger in both the number of instances and the number of features, when the number of features is very large, the Filters model is usually chosen due to its computation efficiency or apply Filters to reduce the dimensionality of feature set before Wrappers. Generally, the characteristics of problems with high-dimensional feature set are described as Large numbers of features, Many irrelevant features, Many redundant features, Noisy data. In paper [2], the development of feature selection has two major directions. One is the filters [8] and the other is the wrappers [9]. The filters work fast using a simple measurement, but its result is not always satisfactory. On the other hand, the wrappers guarantee good results through examining learning results, but it is very slow when applied to wide feature sets which contain hundreds or even thousands of features. Through the filters are very efficient in selecting features, they are unstable when performing on wide feature sets. This research tries to incorporate the wrappers to deal with this problem. It is not a pure wrapper procedure, but rather a hybrid feature selection model which utilizes both filter and wrapper methods. In these method, two feature sets are first filtered out by F-score and information gain, G.VENKATESWARAN 2

3 respectively. The feature sets are then combined and further tuned by a wrapper procedure. We take advantages of both the filter and the wrapper. It is not as fast as a pure filter, but it can achieve a better result than a filter does. Most importantly, the computational time and complexity can be reduced in comparison to a pure wrapper. The hybrid mechanism is more feasible in real bioinformatics applications which usually involve a large amount of related features. In the experiments, we applied the proposed hybrid feature selection mechanism to the problems of disordered protein prediction [10] and gene selection of microarray cancer data [11].The effective feature selection is always very helpful. In paper [3], an important challenge in the problem of classification of high-dimensional data is to design a learning algorithm that can construct an accurate classifier that depends on the smallest possible number of attributes. Further, it is often desired that there should be realizable guarantees associated with the future performance of such feature selection approaches See, for instance, a recent algorithm proposed by [12] involving the identification of a gene subset based on importance ranking and, subsequently, combinations of genes for classification. The traditional methods used for classifying high dimensional data are often characterized as either filters (e.g., [12], [13]) or wrappers (e.g., [14]), depending on whether the attribute selection is performed independent of, or in conjunction with, the base learning algorithm. The proposed approaches are a step toward more general learning strategies that combine feature selection with the classification algorithm and have tight realizable guarantees. In paper [4], With the aim of choosing a subset of good features with respect to the target concepts, feature subset selection is an effective way for reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility [15], [16]. Many feature subset selection methods have been proposed and studied for machine learning applications. They are divided into four broad categories, Embedded, Wrapper, Filter, and Hybrid approaches. The embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the other three categories [17]. The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited and the computational complexity is large. The filter methods are independent of learning algorithms, with good generality. The hybrid methods are a combination of filter and wrapper methods [18-22] by using a filter method to reduce search space that will be considered by the subsequent wrapper. They mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods. The wrapper methods are computationally expensive. The filter methods, in addition to their generality, are usually a good choice when the number of features is very large. Thus, we will focus on the filter method in this paper. With respect to the filter feature selection methods, the application of cluster analysis has been demonstrated to be more effective than traditional feature selection algorithms. In paper [5], Feature selection is the process of detecting the relevant features and discarding the irrelevant ones. A correct selection of the features can lead to an improvement of the inductive learner, G.VENKATESWARAN 3

4 either in terms of learning speed, generalization capacity or simplicity of the induced model. There are some other benefits associated with a smaller number of features: a reduced measurement cost and hopefully a better understanding of the domain. There are several situations that can hinder the process of feature selection, such as the presence of irrelevant and redundant features, noise in the data or interaction between attributes. In the presence of hundreds or thousands of features, such as DNA microarray analysis, researchers notice [23, 24] that is common that a large number of features is not informative because they are either irrelevant or redundant with respect to the class concept. Moreover, when the number of features is high but the number of samples is small, machine learning gets particularly difficult, since the search space will be sparsely populated and the model will not be able to distinguish correctly the relevant data and the noise [25]. There exist two major approaches in feature selection: individual evaluation and subsetevaluation. Individual evaluation is also known as feature ranking [26] and assesses individual features by assigning them weights according to their degrees of relevance. On the other hand, subset evaluation produces candidate feature subsets based on a certain search strategy. Besides this classification, feature selection methods can also be divided into three models: filters, wrappers and embedded methods [27].With such a vast body of feature selection methods, the need arises to find out some criteria that enable users to adequately decide which algorithm to use (or not) in certain situations. This work reviews several feature selection methods in the literature and checks their performance in an artificial controlled experimental scenario, contrasting the ability of the algorithms to select the relevant features and to discard the irrelevant ones without permitting noise or redundancy to obstruct this process. In paper [6], a supervised learning algorithm receives a set of labeled training examples, each with a feature vector and a class. The presence of irrelevant or redundant features in the feature set can often hurt the accuracy of the induced classifier [28]. Feature selection, the process of selecting a feature subset from the training examples and ignoring features not in this set during induction and classification, is an effective way to improve the performance and decrease the training time of a supervised learning algorithm. Feature selection typically improves classifier performance when the training set is small without significantly degrading performance on large training sets [29]. Feature selection is sometimes essential to the success of a learning algorithm.feature selection can reduce the number of features to the extent that such an algorithm can be applied.algorithms used for selecting features prior to concept induction fall into two categories. Wrapper methods wrap the feature selection around the induction algorithm to be used, using cross-validation to predict the benefits of adding or removing a feature from the feature subset used. Filter methods are general preprocessing algorithms that do not rely on any knowledge of the algorithm to be used. There are strong arguments in favor of both methods. This paper presents a careful analysis of arguments for both methods. It also introduces a new method of feature selection that is based on the concept of boosting from computational learning theory and combines the advantages of filter and wrapper methods. Like filters, it is very fast and general, while at the same time using knowledge of the learning algorithm to inform the search and provide a natural stopping criterion. We present empirical results using two different wrappers and three variants of our algorithm. In paper [7], Feature subset selection (FSS) is one of the techniques to preprocess the data before we perform any data mining tasks, e.g., classification and clustering. FSS is to identify a subset of G.VENKATESWARAN 4

5 original features/variables [30] from a given data set while removing irrelevant and/or redundant features [31]. The objectives of FSS are to improve the prediction performance of the predictors, to provide faster and more cost-effective predictors, and to provide a better understanding of the underlying process that generated the data. ANALYSIS: REFERENCE ID PROBLEM DATASET DATA FEATURE SELECTION STATUS 1.Efficient feature selection for highdimensional data using two-level filters Feature subset selection contain high correlated to each other. More irrelevant and redundancy. While using genetic time expense is high. UCI dataset Ionosphere Sonar Spectf Multi-feature Using ReliefF& KNNC to remove irrelevant and redundancy data. ALGORITHM K-Nearest Neighbors cluster algorithm 2.Hybrid selection by combining filters and wrappers Feature subset selection has not accurate and fast. Disordered protein dataset and Microarray dataset. AML and ALL Lung cancer Using three step procedure such as preliminary screening, combination and fine tuning the feature should accurate and fast. Filters Vs Wrappers Hybrid Feature selection 3.Feature selection with conjunctions of decisions stumps and learning from microarray data Learning from high-dimensional data of DNA microarray Microarray dataset Colon Leukemia B_MD and C_MD Lung BreastER Celliac disease Colon epithelial biopsies multiple mylenoma and bone lesion By using three learning algorithm, the microarray dataset process with highdimensional data. An Occam s Razor Learning Algorithm A PAC-Bayes Learning Algorithm G.VENKATESWARAN 5

6 4. A fast-clustering based feature subset selection algorithm for high-dimensional data 5. A review of feature selection methods on synthetic data 6.Filters,Wrappers and a boosting based hybrid for feature selection Irrelevant and redundant data. Some Irrelevant can be removed but redundant data remains in it. Irrelevant, redundant and noisy in the data Both Filter and Wrapper method has some 35 Bench Marks dataset Artificial dataset Multi-class dataset Chess mfeat-fourier coil2000 elephant arrhythmia fqs-nowe colon fbis.wc AR10P PIE10P oh0.wc oh10.wc B-cell1, cell2 cell3, base-hock TOX-171 tr12.wc tr23.wc tr11.wc embryonal-tumours leukemia1 leukemia2 tr21.wc wap.wc PIX10P ORL10P CLL-SUB-111 ohscal.wc la2s.wc la1s.wc GCM SMK-CAN-187 new3s.wc GLA-BRA-180 Corral Corral-100 XOR-100 Parity3+3 Led-25 Led-100 Monk3 SD1* SD2* SD3* Madelon Vote Chess Mushroom DNA Using FAST algorithm and t clusters technique, the irrelevant and redundant data can be removed. Using filters, embedded and wrappers methods used to filters it and ranking the data by ranker method. Using BBHFS(hybrid algorithm), it can merge both methods Fast algorithm Filters method Embedded method Wrappers method Hybrid algorithm Naive Bayes (NB) ID3 with χ2 G.VENKATESWARAN 6

7 7. Feature subset selection and ranking for multi-variate time series drawbacks to retrieve the significant data from large dataset. Feature selection cannot be specified with ranking and time to get the data from it. 1.HumanGait dataset 2.Brain Computer interface (BCI) data set at the Max Planck Institute (MPI) 3.Brain and Behavior Correlates of Arm Rehabilitation (BCAR) kinematics data set Lymphography Ads - to getting the relevant data. Using many method and algorithm it can specified the data with rank and also calculate the time of the data. pruning (ID3) k- Nearest Neighbors (knn). PC and DCPC CLeVer-Rank. CLeVer Cluster CLeVer-Hybrid CONCLUSION: This literature paper provides distinct types of existing feature selection techniques. Feature selection is one of the important process in data mining which can be used to reduce the irrelevant data and discover the significant data from the dataset. This data mining techniques can help to improve the classifier accurate. It ma y also process in other applications such as medical field etc.. The study of distinct techniques of data mining can concluded that there is a novel method for handling insignificant feature in high dimensional datasets. REFERENCES 1) Ferreira and M. Figueiredo, Efficient feature selection filtersfor high-dimensional data, Pattern Recognit. Lett., vol. 33, no. 13,pp , ) H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, Hybrid feature selectionby combining filters and wrappers, Expert Syst. Appl., vol. 38,no. 7, pp , ) M. Shah, M. Marchand, and J. Corbeil, Feature selection withconjunctions of decision stumps and learning from microarraydata, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1,pp , Jan ) Qinbao Song, Jingjie Ni and Guangtao Wang A Fast Clustering-Based Feature SubsetSelection Algorithm for High Dimensional Data IEEE transactions on knowledge and data engineering VOL:25 NO:1 YEAR ) VerónicaBolón-Canedo Noelia Sánchez-MaroñoAmparo Alonso-Betanzos A review of feature selection methods on synthetic data KnowlInfSyst (2013) 34: ) S. Das, Filters, wrappers and a boosting-based hybrid for featureselection, in Proc. 18th Int. Conf. Mach. Learn., 2001, pp ) Hyunjin Yoon, Kiyoung Yang, and Cyrus Shahabi, Feature subset selection and ranking G.VENKATESWARAN 7

8 Formultivariate time series, IEEE transactions on knowledge and data engineering, vol. 17, no. 9, september ) Liu, H., Dougherty, E. R., Dy, J. G., Torkkola, K., Tuv, E., Peng, H., et al. (2005). Evolving feature selection. Intelligent Systems IEEE, 20(6), ) Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. ArtificialIntelligence, 97, ) Linding, R., Jensen, L. J., Diella, F., Bork, P., Gibson, T. J., & Russell, R. B. (2003). Protein disorder prediction: Implications for structural proteomics. Structure, 11(11), ) Guyon, I., Weston, J., Barnhill, S., &Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1 3), ) L. Wang, F. Chu, and W. Xie, Accurate Cancer Classification Using Expressions of Very Few Genes, IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 1, pp , Jan.-Mar ) T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler, Support Vector Machine Classification and Validation of Cancer Tissue Samples UsingMicroarray Expression Data, Bioinformatics, vol. 16, pp , ) Guyon, J. Weston, S. Barnhill, and V. Vapnik, Gene Selection for Cancer Classification Using Support Vector Machines, Machine Learning, vol. 46, pp , ) Liu H., Motoda H. and Yu L., Selective sampling approach to active featureselection, Artif. Intell., 159(1-2), pp (2004). 16) Molina L.C., Belanche L. and Nebot A., Feature selection algorithms: Asurvey and experimental evaluation, in Proc. IEEE Int. Conf. Data Mining,pp , ) Guyon I. and Elisseeff A., An introduction to variable and feature selection,journal of Machine Learning Research, 3, pp , ) Ng A.Y., On feature selection: learning with exponentially many irrelevantfeatures as training examples, In Proceedings of the Fifteenth InternationalConference on Machine Learning, pp , ) Das S., Filters, wrappers and a boosting-based hybrid for feature Selection,In Proceedings of the Eighteenth International Conference on MachineLearning, pp 74-81, ) Xing E., Jordan M. and Karp R., Feature selection for high-dimensionalgenomic microarray data, In Proceedings of the Eighteenth InternationalConference on Machine Learning, pp , ) Souza J., Feature selection with a general hybrid algorithm, Ph.D, Universityof Ottawa, Ottawa, Ontario, Canada, ) Yu J., Abidi S.S.R. and Artes P.H., A hybrid feature selection strategy forimage defining features: towards interpretation of optic nerve images, InProceedings of 2005 International Conference on Machine Learning andcybernetics, 8, pp , ) Yang Y, Pederson JO (2003) A comparative study on feature selection in text categorization. In: Proceedings of the 20th international conference on machine learning, pp ) Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5: ) Provost F (2000) Distributed data mining: scaling up and beyond. In: Kargupta H, Chan P (eds) Advances in distributed data mining. Morgan Kaufmann, San Francisco 26) Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. JMach Learn Res 3: ) Guyon I, Gunn S, Nikravesh M, Zadeh L (2006) Feature extraction, foundations and applications. Springer, Heidelberg. 28) John, G. H., Kohavi, R., &Pfleger, K. (1994). Irrelevant features and the subset selection problem. Proceedings of ICML ) Hall, M. A. (1999). Correlation based feature selection for machine learning. Doctoral dissertation, The University of Waikato, Department of Comp. Sci. 30) H. Liu, L. Yu, M. Dash, and H. Motoda, Active Feature SelectionUsing Classes, Proc. Pacific-Asia Conf. Knowledge Discovery anddata Mining, ) Tucker, S. Swift, and X. Liu, Variable Grouping in MultivariateTime Series Via Correlation, IEEE Trans. Systems, Man,and Cybernetics B, vol. 31, no. 2, G.VENKATESWARAN 8

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

Statistical dependence measure for feature selection in microarray datasets

Statistical dependence measure for feature selection in microarray datasets Statistical dependence measure for feature selection in microarray datasets Verónica Bolón-Canedo 1, Sohan Seth 2, Noelia Sánchez-Maroño 1, Amparo Alonso-Betanzos 1 and José C. Príncipe 2 1- Department

More information

An Empirical Study on feature selection for Data Classification

An Empirical Study on feature selection for Data Classification An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of

More information

ISSN ICIRET-2014

ISSN ICIRET-2014 Feature Subset Collection in High Dimensional Data: Fast Technique V.M.Suresh 1, P.Vennila 2 1 Assistant Professor, 2 PG Scholars 1 Department Of Information Technology, 2 Department Of PG-Computer science

More information

Filter methods for feature selection. A comparative study

Filter methods for feature selection. A comparative study Filter methods for feature selection. A comparative study Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán University of A Coruña, Department of Computer Science, 15071 A Coruña,

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA S. DeepaLakshmi 1 and T. Velmurugan 2 1 Bharathiar University, Coimbatore, India 2 Department of Computer Science, D. G. Vaishnav College,

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Prognosis of Lung Cancer Using Data Mining Techniques

Prognosis of Lung Cancer Using Data Mining Techniques Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,

More information

Feature Subset Selection Algorithms for Irrelevant Removal Using Minimum Spanning Tree Construction

Feature Subset Selection Algorithms for Irrelevant Removal Using Minimum Spanning Tree Construction Feature Subset Selection Algorithms for Irrelevant Removal Using Minimum Spanning Tree Construction 1 Asifa Akthar Shaik, 2 M.Purushottam 1 M.Tech 2 nd Year, Department of CSE, SEAT, Tirupati, AP, India

More information

CBFAST- Efficient Clustering Based Extended Fast Feature Subset Selection Algorithm for High Dimensional Data

CBFAST- Efficient Clustering Based Extended Fast Feature Subset Selection Algorithm for High Dimensional Data CBFAST- Efficient Clustering Based Extended Fast Feature Subset Selection Algorithm for High Dimensional Data Kokane Vina A., Lomte Archana C. Abstract- The recent increase of data poses a major challenge

More information

Efficiently Handling Feature Redundancy in High-Dimensional Data

Efficiently Handling Feature Redundancy in High-Dimensional Data Efficiently Handling Feature Redundancy in High-Dimensional Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-5406 leiyu@asu.edu Huan Liu Department of Computer

More information

A Survey on Clustered Feature Selection Algorithms for High Dimensional Data

A Survey on Clustered Feature Selection Algorithms for High Dimensional Data A Survey on Clustered Feature Selection Algorithms for High Dimensional Data Khedkar S.A., Bainwad A. M., Chitnis P. O. CSE Department, SRTM University, Nanded SGGSIE & T, Vishnupuri(MS) India Abstract

More information

Redundancy Based Feature Selection for Microarray Data

Redundancy Based Feature Selection for Microarray Data Redundancy Based Feature Selection for Microarray Data Lei Yu Department of Computer Science & Engineering Arizona State University Tempe, AZ 85287-8809 leiyu@asu.edu Huan Liu Department of Computer Science

More information

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Domain Independent Prediction with Evolutionary Nearest Neighbors.

Domain Independent Prediction with Evolutionary Nearest Neighbors. Research Summary Domain Independent Prediction with Evolutionary Nearest Neighbors. Introduction In January of 1848, on the American River at Coloma near Sacramento a few tiny gold nuggets were discovered.

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification

Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification 1 Estimating Error-Dimensionality Relationship for Gene Expression Based Cancer Classification Feng Chu and Lipo Wang School of Electrical and Electronic Engineering Nanyang Technological niversity Singapore

More information

Feature Subset Selection Algorithm for Elevated Dimensional Data By using Fast Cluster

Feature Subset Selection Algorithm for Elevated Dimensional Data By using Fast Cluster www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 7 July, 2014 Page No. 7102-7105 Feature Subset Selection Algorithm for Elevated Dimensional Data By

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter Marcin Blachnik 1), Włodzisław Duch 2), Adam Kachel 1), Jacek Biesiada 1,3) 1) Silesian University

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

Unsupervised Feature Selection for Sparse Data

Unsupervised Feature Selection for Sparse Data Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

Feature Selection and Classification for Small Gene Sets

Feature Selection and Classification for Small Gene Sets Feature Selection and Classification for Small Gene Sets Gregor Stiglic 1,2, Juan J. Rodriguez 3, and Peter Kokol 1,2 1 University of Maribor, Faculty of Health Sciences, Zitna ulica 15, 2000 Maribor,

More information

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. X (Mar-Apr. 2014), PP 32-37 A Heart Disease Risk Prediction System Based On Novel Technique

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Feature Subset Selection Utilizing BioMechanical Characteristics for Hand Gesture Recognition

Feature Subset Selection Utilizing BioMechanical Characteristics for Hand Gesture Recognition Feature Subset Selection Utilizing BioMechanical Characteristics for Hand Gesture Recognition Farid Parvini Computer Science Department University of Southern California Los Angeles, USA Dennis McLeod

More information

Gene Expression Based Classification using Iterative Transductive Support Vector Machine

Gene Expression Based Classification using Iterative Transductive Support Vector Machine Gene Expression Based Classification using Iterative Transductive Support Vector Machine Hossein Tajari and Hamid Beigy Abstract Support Vector Machine (SVM) is a powerful and flexible learning machine.

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information

Using Recursive Classification to Discover Predictive Features

Using Recursive Classification to Discover Predictive Features Using Recursive Classification to Discover Predictive Features Fan Li Carnegie Mellon Univ Pittsburgh, PA, 23 hustlf@cs.cmu.edu Yiming Yang Carnegie Mellon Univ Pittsburgh, PA, 23 yiming@cs.cmu.edu ABSTRACT

More information

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

AN ENSEMBLE OF FILTERS AND WRAPPERS FOR MICROARRAY DATA CLASSIFICATION

AN ENSEMBLE OF FILTERS AND WRAPPERS FOR MICROARRAY DATA CLASSIFICATION AN ENSEMBLE OF FILTERS AND WRAPPERS FOR MICROARRAY DATA CLASSIFICATION Mohamad Morovvat 1 and Alireza Osareh 2 1 MS Holder of Artificial Intelligence, Department of Computer Engineering, Shahid Chamran

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Online Streaming Feature Selection

Online Streaming Feature Selection Online Streaming Feature Selection Abstract In the paper, we consider an interesting and challenging problem, online streaming feature selection, in which the size of the feature set is unknown, and not

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Information Driven Healthcare:

Information Driven Healthcare: Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation Dr. Athanasios Tsanas ( Thanasis ), Wellcome Trust

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

A Modified K-Nearest Neighbor Algorithm Using Feature Optimization

A Modified K-Nearest Neighbor Algorithm Using Feature Optimization A Modified K-Nearest Neighbor Algorithm Using Feature Optimization Rashmi Agrawal Faculty of Computer Applications, Manav Rachna International University rashmi.sandeep.goel@gmail.com Abstract - A classification

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

Weighting and selection of features.

Weighting and selection of features. Intelligent Information Systems VIII Proceedings of the Workshop held in Ustroń, Poland, June 14-18, 1999 Weighting and selection of features. Włodzisław Duch and Karol Grudziński Department of Computer

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 2321-3469 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Srikanth Bethu

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information Cluster homogeneity as a semi-supervised principle for feature selection using mutual information Frederico Coelho 1 and Antonio Padua Braga 1 andmichelverleysen 2 1- Universidade Federal de Minas Gerais

More information

Chapter 12 Feature Selection

Chapter 12 Feature Selection Chapter 12 Feature Selection Xiaogang Su Department of Statistics University of Central Florida - 1 - Outline Why Feature Selection? Categorization of Feature Selection Methods Filter Methods Wrapper Methods

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

A Wrapper-Based Feature Selection for Analysis of Large Data Sets

A Wrapper-Based Feature Selection for Analysis of Large Data Sets Edith Cowan University Research Online ECU Publications Pre. 2011 2010 A Wrapper-Based Feature Selection for Analysis of Large Data Sets Jinsong Leng Edith Cowan University Craig Valli Edith Cowan University

More information

Individual feature selection in each One-versus-One classifier improves multi-class SVM performance

Individual feature selection in each One-versus-One classifier improves multi-class SVM performance Individual feature selection in each One-versus-One classifier improves multi-class SVM performance Phoenix X. Huang School of Informatics University of Edinburgh 10 Crichton street, Edinburgh Xuan.Huang@ed.ac.uk

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION Sandeep Kaur 1, Dr. Sheetal Kalra 2 1,2 Computer Science Department, Guru Nanak Dev University RC, Jalandhar(India) ABSTRACT

More information

Best Agglomerative Ranked Subset for Feature Selection

Best Agglomerative Ranked Subset for Feature Selection JMLR: Workshop and Conference Proceedings 4: 148-162 New challenges for feature selection Best Agglomerative Ranked Subset for Feature Selection Roberto Ruiz Jesús S. Aguilar Ruiz School of Engineering,

More information

Reihe Informatik 10/2001. Efficient Feature Subset Selection for Support Vector Machines. Matthias Heiler, Daniel Cremers, Christoph Schnörr

Reihe Informatik 10/2001. Efficient Feature Subset Selection for Support Vector Machines. Matthias Heiler, Daniel Cremers, Christoph Schnörr Computer Vision, Graphics, and Pattern Recognition Group Department of Mathematics and Computer Science University of Mannheim D-68131 Mannheim, Germany Reihe Informatik 10/2001 Efficient Feature Subset

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Information theory methods for feature selection

Information theory methods for feature selection Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský

More information

Feature Selection with Decision Tree Criterion

Feature Selection with Decision Tree Criterion Feature Selection with Decision Tree Criterion Krzysztof Grąbczewski and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,norbert@phys.uni.torun.pl

More information

Cluster based boosting for high dimensional data

Cluster based boosting for high dimensional data Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means

More information

FEATURE SELECTION TECHNIQUES

FEATURE SELECTION TECHNIQUES CHAPTER-2 FEATURE SELECTION TECHNIQUES 2.1. INTRODUCTION Dimensionality reduction through the choice of an appropriate feature subset selection, results in multiple uses including performance upgrading,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

Feature Selection Based on Relative Attribute Dependency: An Experimental Study Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Survey on Rough Set Feature Selection Using Evolutionary Algorithm

Survey on Rough Set Feature Selection Using Evolutionary Algorithm Survey on Rough Set Feature Selection Using Evolutionary Algorithm M.Gayathri 1, Dr.C.Yamini 2 Research Scholar 1, Department of Computer Science, Sri Ramakrishna College of Arts and Science for Women,

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Gene selection through Switched Neural Networks

Gene selection through Switched Neural Networks Gene selection through Switched Neural Networks Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Email: Marco.Muselli@ieiit.cnr.it

More information

Searching for Interacting Features

Searching for Interacting Features Searching for Interacting Features ZhengZhaoandHuanLiu Department of Computer Science and Engineering Arizona State University {zheng.zhao, huan.liu}@asu.edu Abstract Feature interaction presents a challenge

More information

Classification Using Unstructured Rules and Ant Colony Optimization

Classification Using Unstructured Rules and Ant Colony Optimization Classification Using Unstructured Rules and Ant Colony Optimization Negar Zakeri Nejad, Amir H. Bakhtiary, and Morteza Analoui Abstract In this paper a new method based on the algorithm is proposed to

More information

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm Ann. Data. Sci. (2015) 2(3):293 300 DOI 10.1007/s40745-015-0060-x Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm Li-min Du 1,2 Yang Xu 1 Hua Zhu 1 Received: 30 November

More information

Clustering-Based Feature Selection Framework for Microarray Data

Clustering-Based Feature Selection Framework for Microarray Data Available online at www.ijpe-online.com Vol. 13, No. 4, July 2017, pp. 383-389 DOI: 10.23940/ijpe.17.04.p5.383389 Clustering-Based Feature Selection Framework for Microarray Data Smita Chormunge a, * and

More information

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. 1 Topic WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. Feature selection. The feature selection 1 is a crucial aspect of

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Feature Selection/Reduction and Classification in large Datasets using Data Mining Concepts: A Review

Feature Selection/Reduction and Classification in large Datasets using Data Mining Concepts: A Review I J E E E C International Journal of Electrical, Electronics ISSN No. (Online): 2277-2626 and Computer Engineering 5(1): 27-32(2016) Feature Selection/Reduction and Classification in large Datasets using

More information

The importance of adequate data pre-processing in early diagnosis: classification of arrhythmias, a case study

The importance of adequate data pre-processing in early diagnosis: classification of arrhythmias, a case study Data Management and Security 233 The importance of adequate data pre-processing in early diagnosis: classification of arrhythmias, a case study A. Rabasa 1, A. F. Compañ 2, J. J. Rodríguez-Sala 1 & L.

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

[Sabeena*, 5(4): April, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785

[Sabeena*, 5(4): April, 2016] ISSN: (I2OR), Publication Impact Factor: 3.785 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY FEATURE SELECTION AND CLASSIFICATION TECHNIQUES IN DATA MINING S.Sabeena*, G.Priyadharshini Department of Computer Science, Avinashilingam

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information