Genetic Algorithm (GA)

Similar documents
AD-MDI. Applicability domain-model Disturbance Index Tool. NANOBRIDGES -A Collaborative Project

CHAPTER 4 FEATURE SELECTION USING GENETIC ALGORITHM

Genetic Algorithms. Kang Zheng Karl Schober

DETERMINING MAXIMUM/MINIMUM VALUES FOR TWO- DIMENTIONAL MATHMATICLE FUNCTIONS USING RANDOM CREOSSOVER TECHNIQUES

Suppose you have a problem You don t know how to solve it What can you do? Can you use a computer to somehow find a solution for you?

Introduction to Genetic Algorithms

Outline. Motivation. Introduction of GAs. Genetic Algorithm 9/7/2017. Motivation Genetic algorithms An illustrative example Hypothesis space search

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

A Parallel Architecture for the Generalized Traveling Salesman Problem

GENETIC ALGORITHM with Hands-On exercise

Network Routing Protocol using Genetic Algorithms

March 19, Heuristics for Optimization. Outline. Problem formulation. Genetic algorithms

Computational Financial Modeling

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

CS5401 FS2015 Exam 1 Key

The Parallel Software Design Process. Parallel Software Design

A Hybrid Genetic Algorithm for the Distributed Permutation Flowshop Scheduling Problem Yan Li 1, a*, Zhigang Chen 2, b

Generation of Ultra Side lobe levels in Circular Array Antennas using Evolutionary Algorithms

Sparse Matrices Reordering using Evolutionary Algorithms: A Seeded Approach

Evolutionary Algorithms. CS Evolutionary Algorithms 1

Neural Network Weight Selection Using Genetic Algorithms

Research Incubator: Combinatorial Optimization. Dr. Lixin Tao December 9, 2003

GENETIC ALGORITHM VERSUS PARTICLE SWARM OPTIMIZATION IN N-QUEEN PROBLEM

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

Introduction to Evolutionary Computation

Evolutionary Computation Algorithms for Cryptanalysis: A Study

A New Selection Operator - CSM in Genetic Algorithms for Solving the TSP

Importing Career Standards Benchmark Scores

Active contour: a parallel genetic algorithm approach

Genetically Enhanced Parametric Design for Performance Optimization

INTERACTIVE MULTI-OBJECTIVE GENETIC ALGORITHMS FOR THE BUS DRIVER SCHEDULING PROBLEM

Transactions on Ecology and the Environment vol 19, 1998 WIT Press, ISSN

ATI Material Do Not Duplicate ATI Material. www. ATIcourses.com. www. ATIcourses.com

Artificial Intelligence Application (Genetic Algorithm)

JGA User Guide. WinNT/2000 Deployment v

The k-means Algorithm and Genetic Algorithm

AN IMPROVED ITERATIVE METHOD FOR SOLVING GENERAL SYSTEM OF EQUATIONS VIA GENETIC ALGORITHMS

Introduction to Genetic Algorithms. Genetic Algorithms

Genetic Algorithms for Classification and Feature Extraction

Monika Maharishi Dayanand University Rohtak

A Genetic Algorithm Framework

Heuristic Optimisation

Literature Review On Implementing Binary Knapsack problem

Search Algorithms for Regression Test Suite Minimisation

CS:4420 Artificial Intelligence

KNIME Enalos+ Molecular Descriptor nodes

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

Segmentation of Noisy Binary Images Containing Circular and Elliptical Objects using Genetic Algorithms

Genetic Algorithms. Genetic Algorithms

1 Lab 5: Particle Swarm Optimization

Time Complexity Analysis of the Genetic Algorithm Clustering Method

Comparison of Heuristics for the Colorful Traveling Salesman Problem

Deciphering of Transposition Ciphers using Genetic Algorithm

Genetic Fourier Descriptor for the Detection of Rotational Symmetry

CHAPTER 4 GENETIC ALGORITHM

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Evolutionary Computation. Chao Lan

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

BI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR FLEXIBLE JOB-SHOP SCHEDULING PROBLEM. Minimizing Make Span and the Total Workload of Machines

The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem. Quan OuYang, Hongyun XU a*

Genetic programming. Lecture Genetic Programming. LISP as a GP language. LISP structure. S-expressions

Project C: Genetic Algorithms

Function Approximation and Feature Selection Tool

A GENETIC ALGORITHM APPROACH TO OPTIMAL TOPOLOGICAL DESIGN OF ALL TERMINAL NETWORKS

Reducing Graphic Conflict In Scale Reduced Maps Using A Genetic Algorithm

What is GOSET? GOSET stands for Genetic Optimization System Engineering Tool

Computational Intelligence

Application of a Genetic Algorithm to a Scheduling Assignement Problem

Genetic Algorithms Variations and Implementation Issues

ES CONTENT MANAGEMENT - EVER TEAM

Aero-engine PID parameters Optimization based on Adaptive Genetic Algorithm. Yinling Wang, Huacong Li

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Genetic Programming of Autonomous Agents. Functional Description and Complete System Block Diagram. Scott O'Dell

IBM InfoSphere Information Server Version 8 Release 7. Reporting Guide SC

MATLAB Based Optimization Techniques and Parallel Computing

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

Fast Efficient Clustering Algorithm for Balanced Data

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Regression Test Case Prioritization using Genetic Algorithm

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

Advanced Search Genetic algorithm

KNIME Enalos+ Modelling nodes

A Web Page Recommendation system using GA based biclustering of web usage data

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Massively Parallel Approximation Algorithms for the Knapsack Problem

An Evolutionary Algorithm for the Multi-objective Shortest Path Problem

A genetic algorithms approach to optimization parameter space of Geant-V prototype

International Journal of Scientific & Engineering Research Volume 8, Issue 10, October-2017 ISSN

Adaptive Crossover in Genetic Algorithms Using Statistics Mechanism

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Hardware Neuronale Netzwerke - Lernen durch künstliche Evolution (?)

Evolutionary form design: the application of genetic algorithmic techniques to computer-aided product design

Overview. Lab 5: Collaborative Filtering and Recommender Systems. Assignment Preparation. Data

USER S GUIDE Version 1.0

Using Genetic Algorithms to solve the Minimum Labeling Spanning Tree Problem

Path Planning Optimization Using Genetic Algorithm A Literature Review

Non-Linear Least Squares Analysis with Excel

NEURO-PREDICTIVE CONTROL DESIGN BASED ON GENETIC ALGORITHMS

Transcription:

Genetic Algorithm (GA) A QSAR model development tool NANOBRIDGES -A Collaborative Project The authors are grateful for the financial support from the European Commission through the Marie Curie IRSES program, NanoBRIDGES project (FP7- PEOPLE-2011-IRSES, Grant Agreement number 295128).

Genetic Algorithm (GA) Tool for QSAR model development A genetic algorithm (GA) is a search heuristic method that mimics the process of natural selection. Where the exhaustive search is impractical, heuristic methods are used to speed up the process of finding a satisfactory solution. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, crossover, mutation, and selection. The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration is known as a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population [1] [2]. Theoretical background and the Algorithm Initialization of genetic algorithm Initially many individual solutions are usually randomly generated to form an initial population, allowing the entire range of possible solutions. Occasionally, the solutions may be "seeded" in areas where optimal solutions are likely to be found. Selection During each successive generation, a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a fitness function, which evaluates each individual and based on this fitness function the best individuals are selected. The fitness function, used in this program is as follows: i k Pi Pˆ i F i Pˆ 0 i Max Pˆ, i, Min (1) where, F= Fitness score k= Number of parameters Pi= Property of an individual ˆP = Desired value of that property ˆP i,max = Maximum desired value of that property. ˆP i,min = Minimum desired value of that property.

2 r m, and delta At present, 5 validation parameters (hence k=5) are used i.e. 2 r, 2 r Adjusted, q 2, average 2 r m with their desired values >0.6, >0.6, >0.6, >0.5, and <0.2 respectively. Another fitness function i.e., Lack of fit (LOF) which is computed (and displayed in output file) is as follows: LOF= LSE/(1-((c+dp)/M)) 2 (2) where, LSE = least squares of the model c= number of descriptors in the model d= smoothing parameter (default value= 1.0) p= total number of descriptors M= total number of compounds Genetic operators The next step is to generate a second-generation population of solutions from those selected through a combination of genetic operators: crossover (also called as recombination), and mutation. For each new solution to be produced, a pair of "parent" solutions is selected for breeding from the pool selected previously. By producing a "child" solution using the above methods of crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents".generally the average fitness will have increased by this procedure for the population, since only the best solutions from the first generation are selected for breeding, along with a small proportion of less fit solutions. These less fit solutions ensure genetic diversity within the genetic pool of the parents and therefore ensure the genetic diversity of the subsequent generation of children. Although crossover and mutation are known as the main genetic operators, it is possible to use other operators such as regrouping, colonization-extinction, or migration in genetic algorithms. It is worth tuning the parameters such as the mutation probability, crossover probability and population size to find reasonable settings for the problem class being worked on. Termination This process is repeated until a termination condition has been reached. Common terminating conditions are: 1. A solution is found that satisfies minimum criteria. 2. Fixed number of generations (usually user defined) is reached [1] [2]. Genetic algorithm (GA) Tool GA tool (see snapshot 1) performs the genetic algorithm for selection of significant variables (descriptors) during QSAR model development. The fitness function (fitness score; equation 1) used to select the best solutions (i.e. descriptor combination) is well

explained before. Here, the fitness score is directly proportional to the quality of QSAR model. And one can select the best model by selecting the QSAR model with highest fitness score. The Lack of Fit (LOF) fitness function (equation 2) is also computed and displayed in the output file. Though the LOF score is not used in the algorithm for selecting the best QSAR models in each generation, but it is still useful to decide the quality of the model and note that LOF is inversely proportional to the quality of model. Some validation parameters [3] such as R 2, R 2 Adjusted, SEE, Q 2 (LOO), SDEP are also calculated and displayed in output file. One can judge the robustness of the QSAR model by analyzing these validation parameters. Genetic algorithm (GA) Tool Folder The program folder consists of three folders "Data", "Lib" and "Output". For convenience, user may keep input file in "Data" folder and may save output files in "Output folder, since by default, clicking on the browse button will open these folders. "Lib" folder consists of library files required for running the program. Hence try not to move or delete or rename these library files. Snapshot 1: Genetic Algorithm Tool Lib folder also consist of a descriptor database file ( DescriptorDatabase.xlsx ; snapshot 1A) with basic information about descriptors calculated using cerius2, dragon and PaDEL software. This information is used to display brief description about each descriptors selected after running GeneticAlgorithm in the output file. The database file can be updated by the user, either by inserting information about a descriptor in any of

the current sheets or add another sheet for descriptors calculated using different software, if required. Since the program identifies the descriptor from its Descriptor Symbol/Name (first column), it s important that it should be accurately typed or copypasted without any extra space and it is also case sensitive. Further take care that no cell should be left blank. If any information is not available, type Information Not Available. Snapshot 1A Input file format (i.e. training set file format) Three different file types are allowed i.e. xlsx, xls and csv as input file. The input file (see snapshot 2) should consist of compound numbers (first column), descriptor values and the endpoint values (last column) for each object/compound. The format in which this information should be placed in the file is as follows: First Row: Header i.e. name for each column, for instances, descriptor names, endpoint name. It can be numerical, alphabet or alphanumerical in nature. First column: Serial number/compound number (only numerical values) Subsequent columns: Property/Independent variables/descriptor values; each column will consist of each descriptor values for all the nanoparticles. These values should be numerical values and not alphabets or alphanumerical values. Last column: Endpoint values/dependent variables (only numerical values)

Snapshot 2 How to run the program It s simple! Just click/double click on the jar file (GeneticAlgorithm.jar) present in the GA folder. A window will open as shown in Snapshot 1, with few queries (given below) that should be provided by the user before clicking on Submit button to run the program. Select Training set File : Click on browse button to select the training set file. By default, it will open the Dataset folder present in GA program folder. So for convenience, user can keep the input file in the Dataset folder. Select Output Directory : Click on browse button to select the destination/output file directory and define output file name. By default, it will open the Output folder present in GA program folder. So for convenience, user can save the output files in the Output folder. Optionally, user can perform data pretreatment of dataset to remove constant and inter-correlated descriptors prior to Genetic Algorithm execution. If user selects the checkbox labeled, as Data Pre-treatment then the following information has to be provided:

*Enter Variance cut-off: Enter the variance cut-off value based on which the constant variables will be removed. By default, the cut-off value is set to 0.0001. *Enter correlation coefficient cut-off: Enter the inter-correlation coefficient cut-off value based on which the inter-correlated variables will be removed. By default, the cutoff value is set to 0.99. Total number of Iterations : User should mention the number of iteration (generation) the algorithm will run. This is one of the stopping criteria (default value: 100). Other stopping criteria used in this program is that the algorithm will stop if there is no change in fitness function value (less than 0.001 differences in fitness value) for successive 10 generations. Equation Length : This is equivalent to number of descriptor which will be present in the generated models and also in the final best QSAR models (default value: 3). At present, the algorithm used in this tool does not include addition and deletion as operators. Hence the equation length remains same throughout the algorithm. Cross-Over Probability : This value corresponds to the probability value (range 0 to 1) of performing cross-over operation within chromosomes (in this case various generated/selected equation) in each generation. The default value is 1. In this version, user cannot change the default value. Mutation Probability : This value corresponds to the probability value (range 0 to 1) of performing mutation operation within chromosomes in each generation. The default value is 0.5, due to its low probability of occurring in nature. Initial number of equations generated : This value corresponds to the number of equation generated in the first step (first generation). In this version, user cannot change the default value i.e. 100. Number of best equations selected : This value corresponds to the number of best equation selected in each generation (default value: 20). Smoothing parameters (for LOF calculation) : It s one of the terms used in calculation of LOF fitness function (equation 1). The default value is 1. Optionally, user can also perform process validation by selecting the checkbox labeled with Process Validation and then mention the number of random models to be generated (the default value is 10).

Output Snapshot 3 Text file Snapshot 4: _filecode.xlsx file

Snapshot 5: _XRandom.xlsx file 1. Output text file (_GA.txt) (snapshot 3): The generated text file will consist of the resultant MLR equation, fitness function information (i.e. fitness score and LOF value), and validation parameters like r2, r2 adjusted, SEE, q2, and SDEP after the

successful execution of the GA. It will also consist of brief description about each selected descriptors (if the selected descriptor information is present in the database) 2. _Filecode.xlsx (snapshot 4): It consists of the compound no./serial no.(first column), selected descriptors (subsequent columns) and endpoint (last column) information. For each GA run, a new xlsx/xls/csv file with a different file name (comprising of a file code) is generated. This file will consist of information about compound number, descriptors and property/activity values of respective selected best equation. About file code it is well explained in a note below. 3. _Filecode_XRandomResults.xlsx (snapshot 5): The process validation results will consist of r, r^2, q^2 values for the original and all the generated random models; average r, r^2 and q^2 of random models; and crp^2 value. Note: Since GA is usually executed multiple times (like 10-15 times) till an optimized solution is found. For user convenience, if the output file name is not changed and the program is executed number of times by clicking Submit button; the next GA result will be appended in the same text file (snapshot 3). Now user can select the best GA result based on fitness functions and/or validation parameters. Further, to find the respective excel file (.xlsx/xls/csv) encompassing the selected descriptors and endpoint information associated with that particular GA run, one could notice the GA file code (snapshot 3; encircled) provided with every GA result that is printed in the brackets. The respective excel file name will bear the same code (snapshot 4; encircled) as that of the corresponding GA result. References: 1. http://en.wikipedia.org/wiki/genetic_algorithm 2. https://engineering.purdue.edu/~lips/publications/proddesgn/encyc.pdf 3. Roy, K.; Mitra, I., On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design. Combinatorial Chemistry & High Throughput Screening 14, (6), 450-474. Java External Library Used Apache POI the Java API for Microsoft Documents Available at http://poi.apache.org/

XMLBeans Available at http://xmlbeans.apache.org/ Disclaimer For academic purpose only. The program AD-MDI has been developed in Java language and is platform independent. The software is validated on known data sets. Please report for discrepancy of result for any other dataset. Contact us at any of the following addresses: Dr. Tomasz Puzyn, NanaBRIDGES Project Coordinator, Faculty of Chemistry, University of Gdansk, Gdansk, Poland 80-952 Email Id: puzi@qsar.eu.org Dr. Kunal Roy, Drug Theoretics and Cheminformatics Lab., Dept. of Pharmaceutical Technology, Jadavpur University, Kolkata, West Bengal, INDIA-700032 Email Id: kunalroy_in@yahoo.com Software Developer details: Pravin Ambure, Research Scholar, Drug Theoretics and Cheminformatics Lab., Dept. of Pharmaceutical Technology, Jadavpur University, Kolkata, West Bengal, INDIA-700032 E-mail Id: ambure.pharmait@gmail.com (*for any queries regarding the tool)