Domain Independent Prediction with Evolutionary Nearest Neighbors.

Similar documents
Attribute Selection with a Multiobjective Genetic Algorithm

Performance Analysis of Data Mining Classification Techniques

Topic 1 Classification Alternatives

Distributed Optimization of Feature Mining Using Evolutionary Techniques

International Journal of Advanced Research in Computer Science and Software Engineering

C-NBC: Neighborhood-Based Clustering with Constraints

Multi-objective pattern and feature selection by a genetic algorithm

An Empirical Study on feature selection for Data Classification

Weighting and selection of features.

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

Constructing X-of-N Attributes with a Genetic Algorithm

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

More Efficient Classification of Web Content Using Graph Sampling

Extension Study on Item-Based P-Tree Collaborative Filtering Algorithm for Netflix Prize

Multimedia Data Mining Using P-trees 1,2

The k-means Algorithm and Genetic Algorithm

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

A Hierarchical Document Clustering Approach with Frequent Itemsets

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Categorization of Sequential Data using Associative Classifiers

Genetic Algorithms for Classification and Feature Extraction

K-Means Clustering With Initial Centroids Based On Difference Operator

Classification and Feature Selection Techniques in Data Mining

Anomaly Detection on Data Streams with High Dimensional Data Environment

Analyzing Outlier Detection Techniques with Hybrid Method

A New Technique of Lossless Image Compression using PPM-Tree

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Using Association Rules for Better Treatment of Missing Values

Fall Principles of Knowledge Discovery in Databases. University of Alberta

Temporal Weighted Association Rule Mining for Classification

Monika Maharishi Dayanand University Rohtak

Comparative Study of Data Mining Classification Techniques over Soybean Disease by Implementing PCA-GA

Chapter 1, Introduction

Comparison of PSO-Based Optimized Feature Computation for Automated Configuration of Multi-Sensor Systems

K-Nearest Neighbor Classification on Spatial Data Streams. Using P-Trees 1, 2

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

Design of Nearest Neighbor Classifiers Using an Intelligent Multi-objective Evolutionary Algorithm

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees 1, 2

A SURVEY OF DATA MINING & ITS APPLICATIONS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Data Mining: An experimental approach with WEKA on UCI Dataset

Binary Representations of Integers and the Performance of Selectorecombinative Genetic Algorithms

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

entire search space constituting coefficient sets. The brute force approach performs three passes through the search space, with each run the se

Finding Effective Software Security Metrics Using A Genetic Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Classification Using Unstructured Rules and Ant Colony Optimization

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Redefining and Enhancing K-means Algorithm

Classifier Inspired Scaling for Training Set Selection

International Journal of Advance Research in Computer Science and Management Studies

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Evolving SQL Queries for Data Mining

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS

Detection and Deletion of Outliers from Large Datasets

Revision of a Floating-Point Genetic Algorithm GENOCOP V for Nonlinear Programming Problems

On Mining Satellite and Other Remotely Sensed Images 1, 2

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

Efficiently Handling Feature Redundancy in High-Dimensional Data

Review on Data Mining Techniques for Intrusion Detection System

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

CS570: Introduction to Data Mining

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

An experimental evaluation of a parallel genetic algorithm using MPI

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Inferring User Search for Feedback Sessions

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Classifier with the Function-based Decision Tree

Evolution of the Discrete Cosine Transform Using Genetic Programming

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Keyword Extraction by KNN considering Similarity among Features

Feature-weighted k-nearest Neighbor Classifier

Clustering: An art of grouping related objects

A New Genetic Clustering Based Approach in Aspect Mining

Optimization of Association Rule Mining through Genetic Algorithm

International Journal of Advanced Research in Computer Science and Software Engineering

Efficient Case Based Feature Construction

Multi-objective Optimization Algorithm based on Magnetotactic Bacterium

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Iteration Reduction K Means Clustering Algorithm

BRACE: A Paradigm For the Discretization of Continuously Valued Data

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Improved Frequent Pattern Mining Algorithm with Indexing

Transcription:

Research Summary Domain Independent Prediction with Evolutionary Nearest Neighbors. Introduction In January of 1848, on the American River at Coloma near Sacramento a few tiny gold nuggets were discovered. This triggered one of the largest human migratio ns in history as a halfmillion people from around the world descended upon California in search of instant wealth [2]. We live in a data rich information poor environment [1] with a requirement for the migration of computational tools to suite the data to extract valuable information nuggets. Data mining is a multi disciplinary field with the primary objective of supporting knowledge workers to extract the information from large volumes of data. In most of the practical data mining applications large scale tool customization is done for the specific application domain. There is a requirement for a generalized data mining tool for at least one major area of data mining. This work is an attempt to investigate, build, and evaluate a generalized prediction frame work that facilitates easy migration of the tool towards the application in an attempt to do meaningful data mining. We propose a nearest neighbor prediction approach with a genetic algorithm (GA) based relevance tuning for the particular application domain. Generalization of the tool enabling easy migration with the use of a GA may be computationally prohibitive for large data sets. We propose the use of a vertical data mining ready data structure (P-trees 1 ) that would enable the tool frame work to be computationally efficient in the generalized setting. Background The work proposed fall into the area of research categorized as data mining and knowledge discovery with evolutionary algorithms. The main motivation for applying evolutionary algorithms to data mining tasks is that they are robust and adaptive. Classification (Prediction) is 1 P-tree technology is patent pending.

most probably the most widely studied data mining task [3]. K Nearest Neighbor (KNN) classification is well explored in the literature and has been shown to have good classification (prediction) performance on a wide range of real world data sets [4]. KNN is simple and straight forward to implement. The use of a distance metric in KNN opens a wide array of opportunities to use evolutionary techniques to tune the metric to a particular application domain. Most of the existing research use evolutionary techniques for dimensionality reduction and attribute relevance [4],[5],[6],[7] etc. There are some other cases where evolutionary techniques are used to optimize other parameters such as the optimum k in KNN [4]. Genetic algorithms [8] are parallel, iterative optimizers, and have been successively applied to a broad spectrum of optimization problems [4]. Attribute dimensions can be scaled, using, a genetic algorithm, to optimize the classification accuracy of a separate algorithm, such as KNN [7]. The artificial tuning process requires the evaluation of the prediction model iteratively. Iterative evaluation of the data mining model could be computationally expensive for large data sets. P-trees are a lossless, compressed, and data-mining-ready data structure. This data structure has been successfully applied in data mining applications for real world data [6],[9],[10],[11]. Efficient computation of required neighborhood counts leads to a low cost solution for the iterative evaluation required for the GA based evolution (tuning). Two major obstacles with quick and easy migration of a tool frame work are diversity in data and diversity in domain knowledge. These could be addressed with the use of Ptrees and a GA respectively. Proposed approach The main objective of the proposed work is a generalized prediction frame work. This should allow easy migration of the tool framework to different application domains. We propose the use of an artificially tuned nearest neighbor type prediction model. We propose exploring all

possible tuning parameters for the nearest neighbor prediction model with the use of a GA. For example the non restriction of the neighborhood search to k with the use of a GA optimized influence function in the similarity metric (Figure 2). The use of the P-tree data structure allows this work to go beyond the classical nearest neighbor classification. In the classical approach the similarity counting is done through expensive database scans, which is replaced by a collection of logical operations on compressed bit vectors in P-trees. An outline of the proposed architecture is shown in figure 1. Training data from the application domain will be initially converted to P-trees. This will be used for neighborhood counting in the predictor. The GA will be used to tune the predictor. Finally the input samples will be predicted with the use of the tuned predictor. Application Training Data Genetic Algorithm P-tree Engine & Data Repository Nearest Neighbor Predictor Attr. Relevance Data to be predicted Tuned Predictor Figure 1 Proposed outline of architecture Prediction Neighborhood influence Figure 2 Example of two parameters that could be tuned on the similarity metric of the nearest neighbor predictor with the use of the genetic algorithm. Proposed Evaluation With respect to the main objective of this work the tool framework proposed should be evaluated at least in two diverse application domains to show the ease of migration independent of the application domain. It will also be an added advantage to look for an application domain with a

high potential for return with respect to the use of data mining. Prediction applications in bioinformatics and software project cost estimation are proposed as two initial application areas. In bioinformatics there is an abundance of data [12] with some, specific such as protein function prediction and not so specific classification and predication applications. In software engineering there is a specific need for good software cost predictions [13] for the mere survival of the industry with a general intuition that data mining can provide a reasonable solution. Two major criterions for evaluation with respect to the quality of solution are the accuracy and the computational cost. In this work more emphasis will be focused on the accuracy. The proposed evaluation will test the tool framework against published results of existing solutions. Each selected application domain will be tested with only the migration enabled by the artificial tuning proposed in this work. Conclusion As with the human migration in the Gold Rush, we are proposing a tool frame work with quick and easy migration across application domains to find valuable information. The two major obstacles with the migration of data mining tools are diversity in application data and diversity in domain knowledge. The data diversity is handled by the use of a uniform and computationally efficient data structure in P-trees. Diversity in domain knowledge is handled by the use of an evolutionary algorithm in a GA. Successful completion of the proposed work will contribute to the body of knowledge the feasibility of enabling technology for a computationally intelligent gold rush for information in diverse application domains.

References [1] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Academic Press, Morgan Kaufmann Publishers, 2001. [2] http://ceres.ca.gov/ceres/calweb/geology/goldrush.html [3] A.A. Freitas, A survey of evolutionary algorithms for data mining and knowledge discovery, Advances in Evolutionary Computation, pp 819-845, Springer-Verlag, August 2002. [4] M. L. Raymer, W. F. Punch, E. D. Goodman, L. A. Kuhn, and L. C. Jain. Dimensionality reduction using genetic algorithms. IEEE Trans. on Evolutionary Computation, 4(2):164 171, 2000. [5] Yang J and Honavar V. Feature subset selection using a genetic algorithm. In: Liu H & Motoda H (Eds.) Feature Extraction, Construction and Selection: a data mining perspective, 117-136. Kluwer, 1998. [6] P-tree Classification of Yeast Gene Deletion Data. Amal Perera, Anne Denton, Pratap Kotala, William Jockheck,Willy Valdivia Granda,William Perrizo. SIGKDD Explorations. January 2003 Vol 4, Issue 2. [7] W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, and R. Enbody, Further research on feature selection and classification using genetic algorithms, Proc. of the Fifth Int. Conf. on Genetic Algorithms, pp 557-564, San Mateo, CA, 1993. [8] Goldberg, D.E., Genetic Algorithms in Search Optimization, and Machine Learning, Addison Wesley, 1989. [9] Ding, Q., Ding, Q., Perrizo, W., ARM on RSI Using P-trees, Pacific-Asia KDD Conf., pp. 66-79, Taipei, May 2002. [10] Ding, Q., Ding, Q., Perrizo, W., Decision Tree Classification of Spatial Data Streams Using Peano Count Trees, ACM SAC, pp. 426-431, Madrid, Spain, March 2002. [11] Khan, M., Ding, Q., Perrizo, W., KNN on Data Stream Using P-trees, Pacific-Asia KDD, pp. 517-528, Taipei, May 2002. [12] Beck, S. and Sterk, P. Genome-scale DNA sequencing: where are we? Curr. Opin. Biotechnol. 9,116-120, 1998. [13] S. Chulani, B. Boehm, and B. Steece. Bayesian analysis of empirical software engineering cost models. IEEE Transactionon Software Engineerining, 25(4), July/August 1999.