The k-means Algorithm and Genetic Algorithm

Similar documents
Basic Data Mining Technique

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

CHAPTER 5 ENERGY MANAGEMENT USING FUZZY GENETIC APPROACH IN WSN

Unsupervised Learning

Optimization of Association Rule Mining through Genetic Algorithm

Evolutionary Algorithms. CS Evolutionary Algorithms 1

Introduction to Evolutionary Computation

Topic 1 Classification Alternatives

Heuristic Optimisation

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Advanced Search Genetic algorithm

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

Monika Maharishi Dayanand University Rohtak

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Evolving SQL Queries for Data Mining

Introduction to Genetic Algorithms. Based on Chapter 10 of Marsland Chapter 9 of Mitchell

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

Machine Learning: Algorithms and Applications Mockup Examination

Introduction to Design Optimization: Search Methods

RESOLVING AMBIGUITIES IN PREPOSITION PHRASE USING GENETIC ALGORITHM

Grid-Based Genetic Algorithm Approach to Colour Image Segmentation

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Neural Network Weight Selection Using Genetic Algorithms

GENETIC ALGORITHM with Hands-On exercise

Neuro-fuzzy, GA-Fuzzy, Neural-Fuzzy-GA: A Data Mining Technique for Optimization

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

Review on Data Mining Techniques for Intrusion Detection System

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

Maharashtra, India. I. INTRODUCTION. A. Data Mining

IJMIE Volume 2, Issue 9 ISSN:

Inducing Parameters of a Decision Tree for Expert System Shell McESE by Genetic Algorithm

Genetic Algorithm for Finding Shortest Path in a Network

Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Unsupervised Learning: Clustering

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Automated Test Data Generation and Optimization Scheme Using Genetic Algorithm

Application of Genetic Algorithm Based Intuitionistic Fuzzy k-mode for Clustering Categorical Data

Unsupervised Learning : Clustering

Applying genetic algorithm on power system stabilizer for stabilization of power system

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Lecture 8: Genetic Algorithms

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Time Complexity Analysis of the Genetic Algorithm Clustering Method

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Escaping Local Optima: Genetic Algorithm

Mutations for Permutations

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Evolutionary Computation. Chao Lan

Introduction to Genetic Algorithms. Genetic Algorithms

SOCIAL MEDIA MINING. Data Mining Essentials

Introduction to Design Optimization: Search Methods

What to come. There will be a few more topics we will cover on supervised learning

Clustering Analysis of Simple K Means Algorithm for Various Data Sets in Function Optimization Problem (Fop) of Evolutionary Programming

Naïve Bayes for text classification

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

The Genetic Algorithm for finding the maxima of single-variable functions

Genetic Algorithms for Vision and Pattern Recognition

Artificial Intelligence Application (Genetic Algorithm)

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Data Preprocessing. Supervised Learning

Artificial Intelligence. Programming Styles

Reducing Graphic Conflict In Scale Reduced Maps Using A Genetic Algorithm

Introduction to Genetic Algorithms

Genetic Algorithms for Classification and Feature Extraction

IDS Using Machine Learning Techniques

Knowledge Discovery using PSO and DE Techniques

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Suppose you have a problem You don t know how to solve it What can you do? Can you use a computer to somehow find a solution for you?

Clustering & Classification (chapter 15)

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Redefining and Enhancing K-means Algorithm

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Jarek Szlichta

Clustering CS 550: Machine Learning

DATA MINING Introductory and Advanced Topics Part I

Information Retrieval and Organisation

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

CSE4334/5334 DATA MINING

A Steady-State Genetic Algorithm for Traveling Salesman Problem with Pickup and Delivery

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Data Mining Concepts

Introduction to Machine Learning. Xiaojin Zhu

Accelerated Machine Learning Algorithms in Python

Data Mining and Hypothesis Refinement Using a Multi-Tiered Genetic Algorithm

Study on the Application Analysis and Future Development of Data Mining Technology

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A NOVEL APPROACH FOR PRIORTIZATION OF OPTIMIZED TEST CASES

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Genetic Fourier Descriptor for the Detection of Rotational Symmetry

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

UGA: A New Genetic Algorithm-Based Classification Method for Uncertain Data

Role of Genetic Algorithm in Routing for Large Network

March 19, Heuristics for Optimization. Outline. Problem formulation. Genetic algorithms

Genetic Algorithms. Genetic Algorithms

CS5401 FS2015 Exam 1 Key

Transcription:

The k-means Algorithm and Genetic Algorithm

k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2

The K-Means Algorithm The K-Means algorithm is a simple yet effective statistical clustering technique. Here is the algorithm: 1. Choose a value for K, the total number of clusters to be determined. 2. Choose K instances (data points) within the dataset at random. These are the initial cluster centers. 3. Use simple Euclidean distance to assign the remaining instances to their closest cluster center. Chapter 8 3

The K-Means Algorithm 4. Use the instances in each cluster to calculate a new mean for each cluster. 5. If the new mean values are identical to the mean values of the previous iteration the process terminates. Otherwise, use the new means as cluster centers and repeat steps 3-5. Chapter 8 4

The K-Means Algorithm An Example Using K-Means Chapter 8 5

The K-Means Algorithm An Example Using K-Means Chapter 8 6

The K-Means Algorithm General Considerations Chapter 8 7

The K-Means Algorithm General Considerations Chapter 8 8

All instances correspond to points in the n-d space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or realvalued. +. + + _ xq _ +..... Chapter 8 9

For discrete-valued, the k-nn returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. +. + + _ xq _ +..... Chapter 8 10

The k-nn algorithm for continuous-valued target functions Calculate the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query point x q giving greater weight to closer neighbors Similarly, for real-valued 1 target functions w d ( xq, x i ) 2 Chapter 8 11

Genetic Learning Here we present a basic genetic learning algorithm. 1. Initialize a population P of n elements, often referred to as chromosomes, as a potential solution. 2. Until a specified termination condition is satisfied: a. Use a fitness function to evaluate each element of the current solution. If an element passes the fitness criteria, it remains in P. b. The population now contains m elements (m<=n). Use genetic operators to create (n-m) new elements. Add the new elements to the population. Chapter 8 12

Genetic Learning Genetic Algorithms and Supervised Learning Chapter 8 13

Genetic Learning Genetic Algorithms and Supervised Learning Chapter 8 14

Genetic Learning Genetic Algorithms and Supervised Learning Chapter 8 15

Genetic Learning Genetic Algorithms and Supervised Learning Chapter 8 16

Genetic Learning Genetic Algorithms and... Supervised Learning Chapter 8 17

Genetic Learning Genetic Algorithms and..unsupervised Clustering Chapter 8 18

Genetic Learning Genetic Algorithms and Unsupervised Clustering Chapter 8 19

Genetic Learning General Considerations Here is a list of considerations when using a problem-solving approach based on genetic learning: Genetic algorithms are designed to find globally optimized solutions. However, there is no guarantee that any given solution is not the result of a local rather than a global optimization. The fitness function determines the computational complexity of a genetic algorithm. A fitness function involving several calculations can be computationally expensive. Chapter 8 20

Genetic Learning General Considerations Genetic algorithms explain their results to the extent that the fitness function is understandable. Transforming the data to form suitable for a genetic algorithm can be a challenge. Chapter 8 21

GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation Chapter 8 22

Population-based technique for discovery of...knowledge structures Based on idea that evolution represents search for optimum solution set Massively parallel Chapter 8 23

Population Set of individuals, each represented by one or more strings of characters Chromosome The string representing an individual Chapter 8 24

Chromosome 011010 Gene (Allele="0") Locus=5 Gene The basic informational unit on a chromosome Allele :The value of a specific gene Locus : The ordinal place... on a chromosome where a specific gene is found Chapter 8 25

Reproduction Increase representations of strong individuals Crossover Explore the search space Mutation Recapture lost genes due to crossover Chapter 8 26

Parent 1: Parent 2: 011010 000110 Simple reproduction Offspring 1: Offspring 2: 011010 000110 Parent 1: Parent 2: Reproduction with 011010 Offspring crossover at locus 3 1: 000110 Offspring 2: 011110 000010 Parent 1: Parent 2: 011010 Offspring 1: 000110 Simple reproduction with mutation at locus 3 for offspring 1 Offspring 2: 010010 000110 Chapter 8 27

Ability of an individual to survive into the next generation Survival of the fittest Usually calculated in terms of an objective fitness function Maximization Minimization Other functions Chapter 8 28

Based on adaptation and evolution Structures undergoing adaptation are computer programs of varying size and shape Computer programs are genetically bred over time Chapter 8 29

Rule-based knowledge discovery and concept learning tool Operates by means of evaluation, credit assignment, and discovery applied to a population of chromosomes (rules) each with a corresponding phenotype (outcome) Chapter 8 30

Performance Provides interaction between environment and rule base Performs matching function Reinforcement Rewards accurate classifiers Punishes inaccurate classifiers Discovery Uses the genetic algorithm to search for plausible rules Chapter 8 31

Rough sets are used to approximately or roughly define equivalent classes A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) an upper approximation (cannot be described as not belonging to C) Chapter 8 32

Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated Chapter 8 33

For a given new sample, more than one fuzzy value may apply Each applicable rule contributes a vote for membership in the categories Typically, the truth values for each predicted category are summed. Chapter 8 34

Chapter Summary The K-Means algorithm is a statistical unsupervised clustering technique. All input attributes to the algorithm must be numeric and the user is required to make a decision about... how many clusters are to be discovered. The algorithm begins by randomly choosing one data point to represent each cluster. Each data instance is then placed in the cluster to which it is most similar. New cluster centers are computed and the process continues until...the cluster centers do not change. Chapter 8 35

Chapter Summary The K-Means algorithm is easy to implement and understand. However, the algorithm is not guaranteed to converge to a globally optimal solution, lacks the ability to explain what has been found, unable to tell which attributes are significant in determining the formed clusters. Despite these limitations, the K-Means algorithm is among the most widely used clustering techniques. Chapter 8 36

Chapter Summary Genetic algorithms apply the theory of evolution to inductive learning. Genetic learning can be supervised...or...unsupervised typically used for problems that cannot be solved with traditional techniques. A standard genetic approach to learning applies a fitness function to a set of data elements to determine... which elements survive from one generation to the next. Chapter 8 37

Chapter Summary Those elements not surviving are used to create new instances to replace deleted elements. In addition to being used for supervised learning and unsupervised clustering, genetic techniques can be employed in conjunction with other learning techniques. Chapter 8 38

Key Terms Affinity analysis. The process of determining which things are typically grouped together. Confidence. Given a rule of the form If A then B, confidence is defined as the conditional probability that B is true when A is known to be true. Crossover. A genetic learning operation that creates new population elements by combining parts of two or more elements from the current population. Chapter 8 39

Key Terms Genetic algorithm. A data mining technique based on the theory of evolution. Mutation. A genetic learning operation that creates a new population element by randomly modifying a portion of an existing element. Selection. A genetic learning operation that adds copies of current population elements with high fitness scores to the next generation of the population. Chapter 8 40

Data Mining: Concepts and Techniques (Chapter 7 Slide for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada Chapter 8 41