MEANS OF MATHEMATICAL CALCULATIONS IN INFOMATICS

Size: px

Start display at page:

Download "MEANS OF MATHEMATICAL CALCULATIONS IN INFOMATICS"

Eugene Joseph
5 years ago
Views:

1 VYSOKÁ ŠKOLA BÁŇSKÁ TECHNICKÁ UNIVERZITA OSTRAVA FAKULTA METALURGIE A MATERIÁLOVÉHO INŽENÝRSTVÍ MEANS OF MATHEMATICAL CALCULATIONS IN INFOMATICS Study Support Jiří DAVID Ostrava 2015

2 Title: MEANS OF MATHEMATICAL CALCULATIONS IN INFOMATICS Code: Author: Jiří DAVID Edition: first, 2015 Number of pages: 50 Academic materials for the Economics and Management of Industrial Systems study programme at the Faculty of Metallurgy and Materials Engineering. Proofreading has not been performed. Execution: VŠB - Technical University of Ostrava 1

3 List of contents: 1. INTRODUCTION SOURCES OF ERRORS AND STABILITY IN NUMERICAL CALCULATIONS OF CONSIDERED PROBLEMS ROLE OF OPTIMIZATION Current classification of modern optimization methods GENETIC ALGORITHMS BaSIC TERMINOLOGY, OBJECTS AND OPERATIONS OF GA Entire algorithm in brief STring selection Crossover Mutation intervening crossover Types of Genetic algorithms OTHER EVOLUTIONARY ALGORITHMS Evolutionary strategy Differential evolution Self-organizing migrating algorithm SOMA Comparison with genetic algorithms Artificial immune system DATA WAREHOUSES SOURCE OF KNOWLEDGE AND INFORMATION Why to invest into Creation of data warehouses? Methodology of CRISP-DM Data warehouses in continuous production Classification of Data mining tasks Decision trees

4 BASIC INSTRUCTIONS This is the learning material prepared for students of the distance form of study. It has been issued for the subject No Means of Mathematical Calculations in Informatics that is taught within the first winter semester of the two-year Master s degree Programme called Automation and Computing in Information Technology. This subjects has no prerequisites. PREREQUISITES COURSE OBJECTIVES AND OUTPUTS The main objective of this subject is to inform students about the basic sources of errors in numerical calculations as well as about the classification of modern optimization methods used in informatics. Strong emphasis is placed on explanation and thorough practice of students theoretical and practical skills to solve tasks using genetic, or more precisely evolutionary algorithms. Furthermore, this learning material will present basic theoretical knowledge of data mining process, and it will teach students how to use practical skills when solving data mining tasks in real situations, as well as how to make use of acquired proficiency when solving mathematical tasks in Matlab and tasks using genetic algorithms in Matlab Toolbox Genetic Algorithm. TO PASS THE COURSE STUDENT SHOULD BE ABLE TO DO THE FOLLOWING: Knowledge outputs: Student will be able to define the sources and types of errors in numerical calculations. Student will adopt types and principles of modern optimization methods. Student will acquire comprehensive overview of optimization task solutions by using modern methods based on artificial intelligence. Proficiency outputs: Student will be able to suggest a solution of optimization task using genetic algorithms, or more precisely some of the evolutionary algorithm methods. Student will adopt basic principles of data mining methods. Student will acquire basic skills of solving mathematical tasks in Matlab and solving tasks by using genetic algorithms in Matlab Toolbox Genetic Algorithm. TO STUDY EACH CHAPTER WE RECOMMEND THE FOLLOWING PROCEDURE: To read to understand to be able to explain to apply acquired knowledge within the whole master s degree programme. HOW TO COMMUNICATE WITH THE TUTOR: You can contact your tutor personally during regular tutorials or come individually within tutor s office hours, but only after previous correspondence or phone call. Within this subject students are also required to take part and prepare a semester project on a given topic. 3

5 CONSULTATIONS WITH THE SUBJECT GUARANTOR OR TUTOR ARE POSSIBLE: during regular tutorials personally within tutor s office hours, but only after previous correspondence or phone call GUARANTOR OF THE SUBJECT: doc. Ing. Jiří David, Ph.D. TUTOR: doc. Ing. Jiří David, Ph.D. Tel. No.: j.david@vsb.cz 4

6 1. INTRODUCTION Time for studying 30 min. Objective Student will learn the definition and tasks of numerical analysis in mathematics. Lecture: explanation of the basic terminology Numerical analysis is the branch of mathematics which deals with the solutions to numerically formulated tasks using the finite number of logical and arithmetic operations. 2 types of numerically formulated tasks: Numerically formulated tasks tasks with an unambiguous functional relationship between finite input and output data. These are mostly algebraic tasks. It is sometimes possible to find theoretical solution of a task using finite step-by-step set of arithmetic and logical operations, but sometimes it s not possible, thus only approximate solution is offered. Non-numerically formulated tasks usually tasks solved in a process of numerical analysis which includes an infinitely short step (e.g. calculation of derivation or integer, solution of differential equations, etc.); these tasks have to be changed into a numerical type of tasks using some other methods. Numerical method is a procedure of obtaining numerical solutions to tasks, or the process of task s conversion into more simple type, or a procedure in which a mathematical task is substituted by a numerical one. Numerical methods: solutions to mathematical problems with arbitrary accuracy that are achieved after using a finite set of arithmetic and logical operations. Algorithm is a set of rules necessary to be followed when solving a numerical problem, or a finite set of unambiguous instructions that, given some set of initial conditions, can be performed in a prescribed sequence to achieve a certain goal and that has a recognizable set of end conditions. Summary of the terminology Numerical analysis in mathematics, algorithm, types of tasks. 5

7 Questions Explain the concept of numerical analysis in mathematics. Define the concept of algorithm and specify its features. Specify the types of mathematical tasks. 6

8 2. SOURCES OF ERRORS AND STABILITY IN NUMERICAL CALCULATIONS OF CONSIDERED PROBLEMS Time for studying 90 min. Objective Student will learn basic sources of errors, terminology like task conditionality, correctness and numerical stability. Lecture: explanation of the basic terminology Sources and types of errors Errors in input data - Inaccuracy of measuring devices - Inaccurate mathematical model Errors caused by a numerical method which has been applied (Truncation errors) - Method errors Round off errors as a result of rounding-off process in calculations using finite numbers Round off errors Data entry in computer Spreading of errors in calculations Relationship of an error character to the size of step h 7

9 The occurrence of round-off errors for small h is due to various reasons round-off errors appear as a result of subtraction of approximately equal numbers in derivations, or as a result of a sequence of operations in integers. Task correctness and conditionality Task correctness Definition: Let this task find the solution y N (N is a set of possible solutions) for a defined vector x M (M is a set of input data). Then the task is correct only if two following conditions are fulfilled: 1. (exists) just one solution y for M. 2. Solution continuously depends on input data, i.e. if for (all n ) from a set of natural numbers y n is a solution for input data x n, and if y is a solution for input data x, let further ρ be the standard in a set of input data, and solutions, then the following relationship is valid: be the standard in a set of possible Task determination Definition: Task conditionality Cp is determined by the ratio of a relative change of a result to a relative change of input data, i.e.: 8

10 If C 1, we say that the task is correctly conditioned, if C p > 100, the task is incorrectly conditioned. C p 1 p If accuracy of this used type of numbers is ( r (x) ), then the task with cannot be solved within predefined accuracy. Specific methods are often used for incorrectly conditioned tasks to limit the increase of round-off errors. Numerical stability The use of an unstable method (algorithm) causes dramatic decrease of accuracy of task s numerical solution due to accumulation of relatively insignificant errors in every step of calculations. The use of stable methods causes that an output error grows with the number of N steps linearly at most (in ideal but rare situation the sign of an error is random, the round-off error grows ~ N ). Use of unstable methods causes that the round-off error grows faster, e.g. by geometric sequence ~ q N, where q > 1. Instability of an algorithm is caused by accumulation of round-off errors. Instability of a specific mathematical method can be also caused by accumulation of a method error. On the other hand method stability can depend on the size of used step h. Method instability often appears in numerical solutions of a considered problem in ordinary and partial differential equations. Summary of the terminology Sources of errors, task conditionality, correctness and numerical stability. Questions Specify basic sources of errors in solutions when using numerical methods. Characterize specific sources of errors. Explain terminology like task conditionality, correctness and numerical stability. 9

11 3. ROLE OF OPTIMIZATION Time for studying 2 hours Objective Student will learn new facts about the role and methods of optimization. Lesson : explanation of the basic terminology 3.1. CURRENT CLASSIFICATION OF MODERN OPTIMIZATION METHODS Optimization algorithms are used for finding extreme values of a specific objective function by searching for the optimum numerical combination of their parameters. These algorithms can be classified according to the principles of their activity, see below in Table 1. This classification is not the definite one existing, nevertheless with the aspect to the fact that this classification describes the current general situation very well, it can be considered as one possible point of view of not only classic methods, but also modern optimization ones. Table 1 Classification of global search and optimization methods Enumerative Deterministic Stochastic Evolutionary Computation Greedy (evolutionary strategy) Hill-Climbing Taboo Search (restricted search) Stochastic Hill-Climbing Branch & Bound:: (stochastic hill-climbing algorithm) Depth-First Random Search-Walk Best-First Monte Carlo Calculus based Simulated Annealing Mixed Scatter Search & Path Relinking Ant Colony Optimization Immune System Methods Mathematical Programming Particle Swarm Genetic Algorithms Differential Algorithms SOMA (Self-Organizing Migrating Algorithm) 10

12 Enumerative methods are used for calculations of all possible combinations of a given problem. This enumerative approach is suitable for problems in which the parameters of objective function are of discrete character, and they acquire small number of values. Deterministic methods are based only on rigorous methods of classic mathematics. First of all it is necessary to enter preliminary prerequisites, which enables us to obtain effective results of this method. Predefined prerequisites for solution of the function by deterministic method are usually the following ones: - problem is linear - problem is convex - search space of possible solutions is limited - search space of possible solutions is continuous - objective function is, if possible, unimodal (it has only one extreme) - there are no non-linear interactions between parameters inside the objective function - information of gradient type, etc., is available - problem is designated in analytical form Stochastic methods are based on the use of randomness. Basically, they are considered as purely random search of values of objective function with the aim that the result is always the best solution found during the entire random search. These methods are slow, suitable only for a limited number of solutions and for rough estimates.. Mixed methods a refined mixture of deterministic and stochastic methods, which in mutual cooperation, shows surprisingly good results. The above mentioned evolutionary algorithms are considered as a strong subset of these algorithms. Algorithms of mixed character are: - robust (they don t depend on initial conditions, and they very often find satisfying solution which is represented by one or more global extremes) - effective and efficient (they are able to find very good solution in a relatively small number of evaluation steps of objective function) - different from stochastic methods (they combine deterministic methods) - they have minimum (or none) requirements on preliminary information - they are able to process the problems of black box type (they don t need analytical formulation of a problem) - they are able to find more solutions once they are activated. 11

13 Summary of the terminology methods. Classification of global search and optimization methods, deterministic, stochastic and mixed Questions Classify methods of global search and optimization, and give examples. Characterize deterministic methods. Characterize stochastic methods. Characterize mixed methods. What kind of a method are genetic algorithms? 12

14 4. GENETIC ALGORITHMS Time for studying 7 hours Objective types. Student will learn basic principles of genetic algorithms, as well as their basic operations and Lecture: explanation of the basic terminology Genetic algorithms (GA) are considered as the most important and most frequently used representatives of evolutionary computation techniques with diverse applications BASIC TERMINOLOGY, OBJECTS AND OPERATIONS OF GA Basic objects in GA are string, gene and population. Basic operations, which work with these three objects, are crossover, mutation and selection. Gene is a basic structural unit of a string. We can understand the gene as a part of the string that represents fundamental characteristics of an individual. If it is not explicitly stated so the concept of gene means individual fundamental numerical or symbol units of the string. Gene is sometimes called a bit. Population is a group of selected set of strings. From the mathematical point of view population can be understood as two-dimensional space, or rather a matrix of numbers and/or symbols. The size of population is constant in most applications. Example of integer population: P = [ ] Generation is population in the specific calculation phase of GA, or its ordinal number (also algorithm calculation cycle). String (chromosome) is a sequence of numerical or symbol values representing selected characteristics or parameters of an individual (phenotype) from a specific domain 13

15 (from the domain of real physical objects). Strings can be binary, integer, real-number, or symbolic, and combined. Examples of strings: binary string r = [ ] integer string r = [2, 7, 21, O, 105] real-number string r = [7.1, 0.01, 128.0,-1.5] symbolic string r = [positive, small, H 2 SO 4 ] combined string r = [2.77, -2, X, alkaline, off] The method of encoding used in an algorithm depends on the nature of a problem that is being solved. In majority of applications, where numerical values are used, some authors prefer binary encoding even also in the cases when optimized variables are integers or real numbers. Disadvantage of binary strings is not only considerable growth of a string length of majority of variables, but also a limited accuracy of calculations which is determined by the number of bits encoding one number. The number of bits necessary for string encoding can be calculated from the relationship: log D 2 h d Where D is the length of a string in bits, h is upper and d is lower boundary of search space, is required accuracy and * is integer rounded upwards. In discussions about the fact which encoding is better, if binary or real-number, majority of empirical comparisons show that real-number encoding gives slightly better results. Fitness value In literature we can find terminology like fitness, CV cost value, viability, quality of an individual, etc. Fitness is in the terminology of ES frequently used concept representing the rate of suitability, or more precisely fitness of agents (word fit comes from English). The best fitness rate in case of maximization is the highest value of objective function, in case of minimization it is, on the other hand, the lowest value. Some authors prefer the convention that fitness value is maximized by GA. To comply with the basic principle of evolution, which means to prefer (introduce) the strongest individuals into the next generation, or more precisely into the crossover for the next generation, while searching for the minimum value the objective function is transformed into the so-called fitness function whose maximum value is being searched for. For instance, in case of minimization of objective function the fitness function can be defined as follows: fitness = 1/(1+objective function). After this transformation the lower (e.g. negative) values will have adequately higher values, so that lower numbers will be prioritized when searching for fitness function. 14

16 For better understanding let me give you the following examples: 0,5 => conversion of objective function into fitness function => 1/(1+0,5) 0,67 0,3 => conversion of objective function into fitness function => 1/(1+0,3) 0,77 0 => conversion of objective function into fitness function => 1/(1+0) = 1-0,5 => conversion of objective function into fitness function => 1/(1-0,5) = 2 Objective function assigns a value to every string in population. It represents the core of optimization problem, and the task is to find the global extreme of objective function. Objective function is the rate of what we want to maximize (efficiency, production, profit, etc.) or, on the other hand, to minimize (errors, losses, consumption, etc.). The evaluation of objective function is very often difficult procedure in terms of calculations, sometimes consisting of even more criteria (Multi-objective optimization task). Opposite problem An opposite problem is the problem of opposite extreme than the algorithm has been designated for. This problem occurs when we have a method based on procedures that search for a minimum value, but we need to find a global maximum from any reason. And vice versa. This requirement can be met very easily and without any intervention to the source code of a method by simple multiplication of objective function with number 1. However, it is very important to be aware of interpretation of the result. Basically we can say that minimum value of a reversed function is the maximum value of an original function. From the result, which we get when applying the method, we have to consider only its x-th part, y-th part must be calculated. Algorithm only asks about the quality of a string. Thus a function f(x) must exist, which returns a selected string its quality. This function is our well-known objective function. In the example in Table 2 the objective function is only a sum of parameters. However, it could be any function that combines parameters. It is not always true that an agent with the fittest parameters is the best. This issue is quite problem-dependent. Encoding discretization of space Mechanism which allocates an element from search space to each string is called problem encoding. Encoding is allocation of exactly one point of search space to an agent. Thus each string represents one element from the selected interval. This occurs even if population is generated randomly. When we have a generation with, e.g. 8-position-long strings, these strings represent individual points in interval whose boundaries can be selected. To obtain a number, which the individual represents and which is then consequently substituted into the objective function, we have to carry out the so-called string decoding. Relationship of objective function and fitness function Objective function evaluates the rate of how agents in population fit for the requirements of general predefined objective of a solution in search space. In case of 15

17 minimization the fittest agents achieve the lowest values of objective function, in case of maximization, on the other hand, the highest values. In GA success rate or fitness rate of agents is often expressed by fitness function which, according to some authors, only maximizes and directly influences probability with which an agent shifts in the process of reproduction, or more precisely in creation of a new population. Therefore in this case it is necessary to convert values of objective function into fitness function. F(x)=g[J(x)] Where F is fitness function, J is objective function, g is appropriate conversion into non-negative scalar values. One of the possible ways is linear conversion F(x)=a.J(x)+b where a is a positive number only when the objective function is being maximized, or a negative number only when the objective function is being minimized, b is a shift so that the values of the fitness function would be non-negative. Sometimes we use a conversion which allocates agents with fitness values proportionate to their relative fitness within the whole population according to the relationship: n a number of strings in population x i -i-th string F( x ) a i J ( xi ) b J ( x ) n i1 i Some authors do not consider the conversion into fitness function. In this case the objective function J itself is maximized or minimized. 16

18 Principle of genetic algorithms Classic" structure of genetic algorithms can be described as follows: initialization fitness OK? yes solution no selection of a pair of individuals Selection of the fittest agents Reprod. no yes Crossover and mutation Add to a new population, initialization no Completion yes Fig. 4 Classic structure of genetic algorithm Before the entire genetic algorithm starts to work, it is necessary to carry out the following steps: 1. To define the way how to encode parameters of optimized objects into a linear string. This means to select the values (binary, real-number ), and to select the order of individual parameters of a phenotype in the string. The order of parameters is not a decisive factor, but it s necessary to use this order in the whole process. 2. To mark the boundaries in search space of solutions (to define allowable intervals of values for each string gene). The more precise are we in marking the boundaries of search space, the faster is the solution. Suitable demarcation of the space can dramatically decrease the time for finding the solution. 3. To formulate the objective function, or more precisely the way of fitness evaluation. 17

19 4. To select a population size. The population size can be dependent on specific cases. In majority of tasks it is however recommended to select the size between 10 and 700, usually between 20 and 50. In too small populations there is no adequate space for diversity of genetic information, and on the other hand, excessively big populations do not show better effects, so time for finding the solution can be extended ENTIRE ALGORITHM IN BRIEF The entire algorithm can be described as a sequence of the following operations: 0. Initialization of zero generation (Main generation); Till the condition of terminating the whole genetic algorithm is not fulfilled, repeat (1,2,3,4,5) 1. Appraisal of each agent in Main generation; 2. Selection of the fittest agents duplication into Assisting generation; 3. Selection of two suitable parents from Assisting generation; If the condition of parents crossover is fulfilled, then carry out the following: 4. Crossover of pairs of parents birth of two offspring duplication into Main generation; If the condition of mutation is fulfilled, then carry out the following: 5. Random mutation of single agents in Main generation. Termination of genetic algorithms The problem of terminating the whole genetic algorithm is still solved by defining a number of generations with number p max from above (maximum number of algorithm repetitions). Several methods exist how to terminate genetic algorithms [5]. To illustrate this issue let me describe one of them. This method is based on the fact that in each iteration step generation we find a chromosome with the highest fitness value and we label it with α opt. Then let 0 w (α opt ) 1 be the quotient of chromosomes in the population whose fitness value is equal to the fitness value of an agent α opt. Then the genetic algorithm can be terminated when the quotient w (α opt ) > w (α max ), where w (α max ) is a limit value of the quotient (usually 0,8 0,95). This means that the population includes w (α opt ). 100 % of agents are identical with chromosome α opt, thus there is small probability that agents with even higher fitness values are born in the next generations. Terminating conditions of algorithms: ways: Evaluation of terminating conditions of GA can be carried out in one of the following a) The first way is to test how predefined conditions of an expected solution are being fulfilled. If some string meets some of these predefined conditions, the operation of genetic algorithm is terminated, and this string is considered as a solution of GA. b) When searching for a global extreme in majority of tasks we often don t know, from various reasons, how to formulate suitable terminating conditions. Terminating solution 18

20 can be even a situation when the objective function, or more precisely its value, is not changed for a longer time (in defined number of generations). This situation has one drawback. And it is that this situation in GA can mean only temporary stagnation of the solution in local extreme. But GA can generally leave this local extreme after some time. c) Similar possible way of terminating the solution is a situation when single strings in population are very similar to the fittest agent, possibly they are identical with it. Then it is valid that the scope of values of corresponding genes in single strings of population is small. In this case we can t expect more from this situation. Solution can also be only local extreme in this case. d) The most frequent way of terminating the solution is to predefine the requirement of how many times the solution should be generated. We can easily estimate this number of solutions after several repeated initializations of a given solution. Similar alternative is to interrupt GA operation and always decide whether the actual solution is suitable from the point of view of user, or if it is necessary to continue in the process. i. Stagnation Stagnation is a phenomenon which occurs when the value of objective function is stopped towards the lower values yet before reaching the global extreme. This phenomenon occurs without any clear reasons. It differs from early convergence towards suboptimal solution in the fact that it keeps the population constantly diversifiable all the time even if the optimization process doesn t continue any more. It is generally known that use of evolutionary algorithms, or algorithms related to them, can cause the early convergence towards the suboptimal solution only if: - optimization process has brought population into the local extreme of a specific objective function - population has lost its diversity, or - optimization process runs slowly or doesn t run at all Due to this fact optimization process runs slowly or doesn t run at all STRING SELECTION String selection is a procedure which on the basis of a selected strategy chooses from population in specific generation not only some strings that then participate in the process of crossover and mutation, but even also strings that without any change duplicate themselves into the population of a new generation. There are more ways of selecting the strings, but the most widely used are tournament selection and roulette wheel selection. For majority of selection methods it is valid that the fittest strings have higher probability to be selected into the next population than less fit strings, i.e. better agents from population gradually replace worse ones. Roulette wheel selection Let s imagine this roulette wheel selection as a classic roulette game but only with one difference. In classic roulette game we have for each number that can be drawn the same part of the wheel indicating a number the ball lands on, thus each number can be selected with the same probability as e.g. adjacent number. In roulette wheel selection the probability of 19

21 selection is determined by the quality of a string. Wheel that turns around its centre is divided into as many wheel proportions as many strings there are in the population. The size of each wheel proportion can be defined by more methods. The probability of selection of i-th string by this method from population of N size can be expressed by the following relationship: P( i) N j1 f ( x ) i f ( x j ; ) pro j 1,2,..., N Where f(x) is fitness (or more precisely non-negative value of objective function for maximization). Percentage points will be obtained by multiplying this value with number 100. Tournament selection This selection process can be easily described. We randomly select two different agents from an actual generation, and only the fitter one from this couple passes into a new generation. If one of these two agents is the fittest one, it is with high probability promoted into the next generation. On the other hand the less fit string doesn t have almost any chance to pass into the next generation. The best advantage of this selection method is its speed. When using this method we follow the bellow described algorithm: 1. Couple of strings (possibly even a larger group) is randomly selected from the whole population. 2. An agent having better (the best) fitness value is duplicated into the group of selected strings. 3. Unless required number of strings is selected, we ll continue with the point 1. Let s point out that agents, which have participated in tournament selection, are not disqualified, each string can be selected more times. Stochastic uniform sampling When comparing these above mentioned methods so roulette wheel selection has one disadvantage which can be traced back when a number of selected strings is relatively small. Expected statistic characteristics of this method are demonstrated only in case of high number of selected strings. This disadvantage can be eliminated by the following modification. Instead of one selection indicator we deploy so many selection indicators on the roulette wheel perimeter in same intervals so many strings we d like to select. Then we turn the roulette only once and select these strings marked by indicators. According to the situation in Fig. 7 we select 8 strings with ordinal numbers 1, 2, 3, 3, 5, 5, 7, 8. 20

22 1xkrát Fig. 7 Basic description of stochastic uniform selection (sampling) Success rate selection Previous methods are based on the probability selection model. The following method represents a deterministic way where selection is strictly defined by a user. As a rule we must define how many duplicates of individual fittest strings must be selected from the population. Let defined pattern of a selection method enable to select the fittest string three times, the second fittest string two times and the third fittest string once respectively. When implementing this selection method the algorithms quickly converge, but in most cases only to the nearest local extreme, where they tend to stay on (in many generations). But it s necessary to point out that ignored weaker agents can carry useful information which might help to find the global extreme. Random selection From a number of n strings in population there is a number of m strings randomly selected, and basically the same string can be selected more times. In this case each string from population has the same chance to be selected regardless to its fitness value. However, this selection method isn t used as the only one selection method within a specific algorithm, but only in combination with the others (e.g. with the previously mentioned methods). It doesn t guarantee one important fact that fitter strings must have better chance to survive than less fit ones. Diversity rate selection In most cases it is very useful to select strings which have the most diverse characteristics in comparison with the majority of other strings in population, or more precisely with regard to the fittest strings in population. These agents bring new genetic information into a new population and can initiate new directions of search for the global extreme. This is appropriate especially when the solution stagnates for a long time in the vicinity of the global extreme. One of the selection methods is a search for the most diverse strings or strings that have the highest Euclidean distance with regard to a reference string. 21

23 e p i1 ( r i s i ) 2 max Where r i are genes of the reference string, s i are genes of the compared string, p is a number of genes of each string in population. It is good to regard the best string, or possibly also the string which originates by averaging of genes across the actual population, as the reference string. Elitist selection String with the best fitness value, or possibly the fittest strings of the actual generation, passes without any modification into the next generation. This act can be described as elitism. Importance of an individual s age in the selection process Regarding the biological evolution one of the most important factors influencing population s life is the age of its individuals. Older individuals are gradually replaced by younger ones, while their fitness value is not being considered. In this case the age is equivalent to the number of generations in which the specific agent exists in unchanged form in population. Due to this fact the diversity in population increases, and the population s effort for stagnation in the local extreme is eliminated CROSSOVER Crossover is an operation during which two parent strings are split in randomly selected but for both strings same positions (or more positions), and mutually exchange one of the two corresponding parts (alternately every second part) of a string. Due to the crossover mechanism new different offspring which carry some features of their parents are being produced. The entire mechanism of crossover starts when two strings are selected from a newly defined population. These two strings can be understood as two parents. Their offspring then carry parts of their parent s genetic information. Each of the offspring, at the same time, carries only a piece of information from their predecessor. In the basic version of crossover two strings - offspring are produced again from two parent strings. Parent strings are split in one position / point (single-point crossover), or in more positions / points (multi-point crossover), and they mutually exchange single parts of their strings. (See Fig. 9). Positions, where strings split, are selected randomly Single-point crossover of a real-number string. 22

24 2. gen Principle of multi-point crossover Single-point crossover of binary string with two eight-bit numbers 8 r b2 6 P1 Rb 4 r a2 2 Ra P2 0 r a gen r b1 Parents Ra = [r a1, r a2 ] Rb = [r b1, r b2 ] Offsprings P1 = [r a1, r b2 ] P2 = [r b1, r a2 ] Fig. 9 Geometric interpretation of crossover of two-point real-number strings. In the entire population crossover can occur between two (or n ) adjacent strings, or randomly between any two strings forming couples. GA is usually designed to have between 75% and 95% of all strings crossed over in one generation. This number is sometimes called the probability of crossover in GA ( p k ). Other strings pass unchanged into the next generation MUTATION The last step in production of a new generation is mutation. Same as crossover, mutation occurs only with a certain probability. This probability is in comparison with crossover considerably smaller (e.g. 0,01). The whole operation is carried out as follows: we select strings one by one from one population. We examine each string step by step in individual positions (zeros and ones). According to the value of mutation probability we change the value in individual positions of a string. For instance, if there is the value 1 in this position, we ll change it to 0, and vice versa. Mutation helps in situations when all the strings are very similar to one another. Mutation is an operation during which randomly selected gene (or more genes) of a random string (or more strings) in the population changes its value to another random value 23

25 from the limited scope of values in the search space. Mutation enables to find new solutions which haven t appeared in the population yet. Mutation is the basic driving force of GA, or more precisely of ES. R=[g 1, g 2, g 3,, g i-1, gi, g i+1,, g n ] P=[ g 1, g 2, g 3,, g i-1,, g i+1,, g n ] (g i,min ; g i,max ) where R is a parent string, P is offspring, g i are individual genes is mutated gene. Based on the nature of a considered problem it s important to carry out mutation in more ways: a) Uniform mutation: the value of a selected gene is replaced with a uniform random value from its equivalent scope (gi,min; gi,max), where I is an ordinal number of a gene in the string. In binary encoding of strings the value of a bit is changed into its complementary value. b) Additive mutation: can be used only in integer or real-number encoding. The value of the selected gene is multiplied by a random number from the defined scope: =g i + ; (min; max) c) Multiplicative mutation: is also applicable only in integer or real-number encoding. The value of selected gene is multiplied by a random number from the defined scope: =g i * ; (min; max) Mutation probability of one gene from all the genes in the entire population can usually be in the scope from 0,0001 to 0,1 (or more precisely 0,1% to 10%) according to the type of a method applied INTERVENING CROSSOVER A unique operation in especially numerical types of tasks (optimization in continuous space) is a specific combination of crossover and mutation intervening crossover, which was adopted from evolutionary strategies. It can be used only for real-number encoding of strings. Let s assume two parent strings g a =[g a,1, g a,2,, g a,n ] T a g b =[g b,1, g b,2,, g b,n ] T new offspring is produced by operation according to the relationship p = g a + *(g b -g a ) where is a matrix of n x n size which has random numbers of the scope from 0.25 to 1.25 on its main diagonal, and numbers outside the main diagonal are all zero. 24

26 2. gen Geometric interpretation of this operation is in Fig g b2 P3 Rb 6 P1 4 g a2 2 Ra P2 P4 0 g a gen g b1 Parents Ra = [g a1, g a2 ] Rb = [g b1, g b2 ] Offsprings P1, P2, P3, P4 Fig.11 Geometric interpretation of intervening crossover 4.8. TYPES OF GENETIC ALGORITHMS Hybrid GA Large group of genetic algorithms is formed by the so-called hybrid GA when other well-known methods are combined with GA. For instance, with the help of other non-genetic optimization methods (e.g. simulated annealing) we generate initial population and then with the help of GA we try to find even better solution. Another type is a hierarchic structure of GA when there are groups of non-genetic methods running in parallel on the lowest level of GA. Parallel GA Population method used for the search of state space has parallelism embedded in its basics. This parallelism can be used in calculations that are carried out with appropriate technical equipment and with relevant programming tools. If we solve very difficult and timeconsuming optimization problems, it is beneficial to use parallel GA. Substantial improvement of classic GA is done by implementation of migration operator which accelerates the process of finding the global extreme. The basis of parallel GA is the following: by using GA the task is solved independently in several small populations from which some (e.g. the fittest) agents pass on their genetic information to the main population. By doing this the fittest agents create a rich gene pool which converges to the target solution considerably faster. 25

27 Multilevel distributed GA With regard to the fact that operation of an algorithm depends on its initial population, and that GA behaviour is strongly influenced by a series of implementation details, the socalled multilevel distributed GA has been implemented. This method can be described as follows: several GA are initiated at the same time (with diverse initial population with different set of parameters, and with different implementations). The moment when algorithm reaches hierarchically highest level can be regarded as the final solution of the entire operation. In multilevel distributed algorithms, subsidiary algorithms are successfully restarted so many times until there is no improvement of a solution after a certain predefined number of iteration steps, or they do not offer GA of higher structure any new solution, or no solution offered by them is accepted. Comparisons of difficulties in terms of time necessary for finding the solution, or the number of iteration steps necessary for finding the solution, between ordinary and distributed GA (the number of iterations is determined in multilevel distributed GA according to the number of iterations carried out by an algorithm on the highest hierarchical level) show significant advantages of this multilevel arrangement. Summary of the terminology Genetic algorithm, gene, population, generation, chromosome, types of chromosomes, selection, crossover, mutation, string encoding, space discretization, selection methods, crossover methods, mutation methods, types of genetic algorithms. Questions Give the definition of genetic algorithm. What tasks is the genetic algorithm used for? Explain the following terminology: gene, population, generation, and chromosome. Describe the principle of string encoding, and explain the importance of space discretization. Describe the diagram of genetic algorithm. What is elitism? Define it. Name the selection methods and characterize them briefly. Name the crossover methods and characterize them briefly. Name the mutation methods and characterize them briefly. Characterize types of genetic algorithms. 26

28 5. OTHER EVOLUTIONARY ALGORITHMS Time for studying 2 hours Objective Student will learn new facts about the role and methods of evolutionary algorithms. Lesson: explanation of the basic terminology The genesis of other optimization algorithms started in 90 s of the last century. Scientists didn t only want to improve basic operations of evolution (selection, crossover and mutation), but since that time they have been trying to implement these operations into entirely other procedures. And as a result of this implementation new versions of the algorithm tools have appeared. Furthermore, strict adhering to the analogy with nature has been slowly abandoned. The presence of several (!) parents in production of one offspring can serve as an ideal example EVOLUTIONARY STRATEGY The very first version of evolutionary strategy (ES) considered only one single agent represented by a couple of vectors v =(x, ), where x was a vector of searched parameters (points coordinates in the search space) and was a vector of root-mean-square deviations of possible modifications of vector x components. Since that time modifications have been implemented due to one operator mutation, x t1 x t N(0, ) (X3,1) where N is a vector of independent random numbers which have ordinary distribution of probability with zero expected value and root-mean-square deviation. Probability density function of a continuous random variable is in the form of h( x) e ( x 2 2 / 2 ) As influences the Gaussian curve width, it is then simply described as the size of a mutation step. Modifications of agents correspond with the fact that in nature small changes occur more often than big ones. Into the next generation passes on only that one from a couple of parent + offspring which has better fitness value of objective function, or possibly that one which complies with all the restrictions. This version of algorithm is called (1+1) 1 parent + 1 offspring. 27

29 From the beginning has been considered as a vector of invariables. Based on the previous statistical observations it has been found that a number of mutations leading to the improvement of objective function represents approx. 1/5 of all mutations. It has led to the following algorithm modification. After k generations, during which success rate of mutations is evaluated, parameter is being corrected. t 1 t c, if 1/5; t1 t / c t1 t, if 1/5;, if =1/5; where 0.817<c<1 was empirically obtained invariable. This modification caused that if mutations have been successful, the leap of changes increases, if they haven t been, the leap of changes decreases. Despite this modification the solution of algorithm has often been stuck in the local optimum. That s why the size of population was (same as in GA) increased from 1 to n agents, what has brought more parallel directions of search. Furthermore, crossover operator has been implemented. This crossover operator has been applied to vector constituents x, and at the same time to vector constituents. By crossover of two parents Xa and Xb an offspring is produced Xa={x a, a }={x a1,x a2,...,x an, a1, a2,, an } Xb={x b, b }={x b1,x b2,...,x bn, b1, b2,, bn } Xp={x p, p }={x p1,x p2,...,x pn, p1, p2,, pn } whose components are the combination of parents Ra and Rb. In comparison with an original method used in GA the method of crossover has been changed into discrete crossover (Note this issue is not included in this learning material) or intervening crossover. These types of crossover methods started to be used later also in GA. In ES versions with more agents in a population the deterministic adaptation of vector, functioning according to the rules based on 1/5 of success rate (see above mentioned relationships), stopped to be used. What s more, even this part of a string started to mutate similarly as in case (X3.1). In populations with more constituents two selection strategies started to be used. Firstly, of offspring are produced from randomly chosen couples of parents by crossover and mutation (note >, - number of offspring, - number of parents). In the so-called (+)" type parents and offspring form one group, from which of the fittest agents are again chosen by a deterministic method. In, (,)" type of the fittest agents are chosen, but only from group of offspring. Due to this selection method the life of one agent is limited only to one generation. Furthermore, a shift of the fittest agent from the preceding generation into the population of a new generation is not guaranteed. On the other hand this method shows certain advantage in tasks where the position of global optimum is changing in time, or it is noisy. 28

30 It is usually recommended to choose the ratio :, approx. 1:7 (e.g. 15 parents to 100 offspring). If we want to speed up the local convergence, we can change this ratio even to the value 1:20 (e.g. 5 parents, 100 offspring) DIFFERENTIAL EVOLUTION Differential evolution is the name of a relatively simple but very efficient optimization algorithm. Although it uses only a few parameters, it is able to solve numerical tasks very efficiently. The structure of differential evolution (DE) is very similar to the structure of genetic algorithms, with which it has several common features, for example production of offspring (here by 4 parents instead of 2, as it is in genetic algorithms), using the so-called generations etc. In this chapter you ll find the basics of this algorithm including differences from GA. DE works with populations same as other evolutionary algorithms. The principle of its function and parameter management of agents is the same as for other evolutionary algorithms. In evolutionary algorithms, same as in the classic evolution theory, the key role is played by the so-called mutation. DE differs from other methods in one fact that for production of another offspring it is not necessary to have two parents, but four ones. For each agent three other agents are randomly chosen from population (r1, r2, r3). Using these three agents the so-called noise vector is created. The noise vector is nothing more, but mutation of these three combined parents. To be more specific, mutation is carried out in the way that the difference between first two parents is multiplied by mutation invariable F, and the resulting vector is added to the third parent which has been left (see the following equation). v j = x G r3,j + F. (x G r1,j - x G r2,j) (3.6) Method of differential evolution has another peculiarity. This peculiarity means that in differential evolution crossover takes place after the mutation process, while, for instance, in genetic algorithms firstly an offspring is produced by crossover, and then mutation is initiated. The process of crossover in DE means that a new agent, which is called a trial vector, is produced from the fourth parent (so far not used) and noise (mutated) vector. The trial vector is created by using the so-called crossover invariable CR. The corresponding parameters from the fourth and noise agent are defined in the cycle, and then a random number is generated for every selected couple. If this random number is smaller than CR, then the parameter from noise agent is shifted to the corresponding parameter in the trial vector. A new agent is created this way, and it competes for the place in a new population with actual the fourth agent from the old population. If we want to find an analogy of this method with ordinary life, then the fourth parent is actually the son of other three parents, and at the same time he is the father of his brother. In this point DE differs from the basic rules of evolution, and from moral aspects let s do not continue with other explanations. Another difference of DE from GA is a process of testing how terminating parameters are being fulfilled. DE is in fact terminated only when a user carries out predefined number of generations. There is no other terminating parameter included in the algorithm, but on the 29

31 other hand, we can t ignore the possibility that programmers can be very creative and they could come with other versions of terminating conditions. During every generation the value of objective function of the fittest agent is saved into the history vector which after the termination of the process figures out the course of entire evolutionary process. DE doesn t have only advantages. One of the disadvantages is the so-called stagnation which is characteristic for this algorithm. Description and reasons of stagnation of differential evolution do not differ from the same phenomenon in other evolutionary techniques, thus not from stagnation of genetic algorithms, that s why we can now refer to the chapter Despite the general facts explained in this chapter optimization process running under the differential evolution shows stagnation under specific conditions in spite of the fact that: Population has not converged to the local extreme. Population has not lost its diversity. Even new agents are produced henceforward. This problem occurs in DE only with a certain suitable size of population and controlling parameters. These findings are useful to be included in the design of population and controlling parameters. The risk of stagnation is in inverse proportion to the number of new possible solutions, new agents which can be generated by an algorithm during one generation. In other words, the bigger is the number of possible solutions during one generation, the smaller is the risk of stagnation Algorithm of differential evolution Let every agent of population be in the form of a vector with searched parameters x i = {x i,1 ; x i,2 ;...,x i,n }, where I is its ordinal number in population. The basis of this algorithm is a specific way of creation of the so-called trial agents. These trial agents are newly produced offspring which compete for the survival with their parents. Algorithm of DE for calculation of one generation is the following: 1. For an agent x i in actual population of N string size, where i is an ordinal number of an agent in population, a vector v is generated v x F( x x 3) r1 r2 r where r 1, r 2, r 3 are different random ordinal numbers in population (from 1 to N), as well as i from i are different, and 0<F<2 is an invariable. Literally taken: we choose one random string from the population. This random string is different from x i, and the difference of vectors of two other randomly selected strings, which has been multiplied with number F, is added to this random string. 30

32 2. We create the so-called trial string u. For every index j=1,...,n u j = v j if CR where or u j = v i, j if CR j is an ordinal number of vector u and v s components, is a random number from interval (0;1), CR is a parameter of crossover probability from the same interval. It s actually the creation of a string by discrete crossover of v a x i strings. 3. In the last step the i-th agent of a new population is generated (let s consider the case of minimization) x t 1 i u t if f ( u) f ( ) x i x t1 i x t i or t if f ( u) f ( ) x i where t is a number of actual population. It means that the agent, which is more successful t in the couple of x i - the original agent and u- trial string, passes to a new generation as i-th agent. 4. Sequence of steps 1 3 is repeated for all strings of an actual population (i=1,2,...,n). Based on the main facts DE algorithm is a simple procedure, and it requires only two CR and F parameters which are defined experimentally. Their values influence the correct function and speed of algorithm convergence. DE is suitable especially for numerical optimization of standard functions, and in majority of cases it converges quite quickly. In case of very complex multimodal objective functions it can sometimes get stuck in some local extreme SELF-ORGANIZING MIGRATING ALGORITHM SOMA The function of SOMA algorithm is based on geometric principles. Regarding the fact that this algorithm works with populations like e.g. genetic algorithms, and the result after one evolution cycle, or migration cycle, is identical with genetic algorithms or differential evolution, it can be considered as one of evolutionary algorithms even despite the fact that new offspring are not produced in this process, like it is in other evolution algorithms. The original idea leading to the development of this algorithm is based on the imitation of behaviour of a group of individuals that cooperate in solving their collective problems like e.g. search for the source of food etc. 31

33 In contrast to other evolutionary algorithms, in SOMA algorithm new solutions (agents, offspring) are not produced by the process of parent crossover, but its function is based on a cooperative search (migration) in the space of possible solutions of a given problem. Since SOMA itself doesn t copy already mentioned evolution principles, but it follows the principles coming from the cooperation of intelligent individuals migrating in the space of possible solutions as well as their biological counterparts do in the natural environment, the evolution cycle known as generation has been renamed as migrating cycle. As the main idea of SOMA algorithm is not based on the principle of evolution itself, but it is based on already mentioned principles of a pack, it s not classified as an evolutionary algorithm, but it is called a memetic algorithm. In its pure mathematical basis this fact doesn t play any important role, because the meaning of a parallel algorithm (i.e. algorithm working with a set of all solutions at the same time) is to search for new better solutions positions on the hyperplane. If evolutionary principles on the level, e.g. of genome, are used for calculations of new positions, these algorithms are called genetic etc. The key role in all evolution processes is played by the so-called mutation. Mutation procedures differ from each other in individual algorithms, nevertheless what they have in common is the fact that a randomness generated by a generator of random numbers is used in the process. In case of SOMA mutation it is called perturbation. Reason for using another new terminology is simple. When agents are moving through the space of possible solutions their movement is randomly disturbed (i.e. perturbed) and not mutated. How strong the perturbation is (how many parameters of an agent will be modified) depends on the setting of PRT parameter, which is one of the user-defined parameters for SOMA algorithm. By using this algorithm a perturbation vector (PRT vector) is generated. This vector is generated for every agent separately and it is valid only for one actual process of one active agent. PRT parameter is basically an equivalent of mutation probability parameter in genetic algorithms. Crossover is replaced in SOMA by migration of an agent on the hyperplane. During this activity every agent remembers coordinates of the positions where it has found the best solution on its journey, and this solution passes to the next migrating cycles. Principle of SOMA algorithm can be characterized as competitively-cooperating behaviour of agents solving their collective problem. The behaviour of wolf pack hunting and trying to find food can serve as a very good example. In the phase of cooperation individual wolves convey to one another what the quality of food they have just found is, and based on this fact they try to adjust their behaviour. In the phase of competition each wolf tries to win over the others it tries to find the best source of food. After this phase of competition the phase of cooperation begins again, and wolves exchange information about the fact which one has the best source of food. The others leave their sources of food and migrate (phase of competition) towards the wolf with the best source of food. During this migration they try to find even any source of food. This repeats till all wolves meet over the richest portion of food. SOMA algorithm functions on this very simplified principle as well. SOMA with its robustness in terms of finding the global extreme can cope with those algorithms like e.g. 32

34 differential evolution, and it is able to find even very deep global extremes of given objective functions COMPARISON WITH GENETIC ALGORITHMS Similarity and difference of SOMA and other algorithms can be summarized as follows [21]: Formation of new population: GA by two parents. DE by four parents. SOMA by one with all (AllToOne variant), or all with all (AllToAll variant) Randomness: GA by mutation operator, agent moves in N - dimensional space DE same as GA SOMA by PRT vector, agent moves in N-k - dimensional subspace, which is perpendicular to original N - dimensional space. Selection of parents: GA by means of various techniques (roulette, fitness tournament, randomly,...) DE 3 from 4 pure randomness, the fourth one is calculated from the first three, SOMA doesn t select (the exception is AllToRand version and / or AllToAll Adaptive version) 5.5. ARTIFICIAL IMMUNE SYSTEM Not only have other types of evolutionary computational techniques (EVT) found its inspiration in nature, but there is also another approach based on imitating the principles of immune system of living organism. First of all let s simply describe biological background of this mechanism. When unknown antigen enters the human body, immune system starts to produce antidotes. Antidotes are specific molecules that are able to recognize these unknown antigens and then bind themselves to antigens specific parts called epitopes. Each antigen has more epitopes. This means that every type of antigen can be recognized by more types of antidotes. If the presence of unknown antigen is detected, immune system starts to reproduce (clone) these antidotes which have found this intruder. During this process newly produced molecules of antidotes mutate, and thus some even better ones are able to specialize on unknown antigens. These better antidotes further initialize production of their own clones, the rest of them is not used. Within this process the better antidotes, or more precisely their characteristics, start to adapt more to the struggle with a specific type of antigens. On the basis of the above described process a new algorithm was designed in [Castro 02]. This algorithm was applied to optimize multimodal functions. A cell is one possible solution (same as a string or chromosome), representing a vector of real number values x R n. Population is a group of cells. 33

35 Affinity is a degree of similarity of two cells expressed by Euclidean distance of relevant numerical vectors. Clone is an identical duplicate of a parent cell. Algorithm of Artificial Immune System is the following: 1. Random initialization of n pop cell population and evaluation of their fitness value. 2. Test of terminating conditions. 3. Evaluation of fitness vector of the entire population. 4. Generation of n c clones of every cell in population, through which expanded population is created. 5. Mutation of each clone, inversely proportional to its success rate according to the relationship (5.1) (successful cells are mutated less, unsuccessful are mutated more). 6. If more successful mutated clone of a specific parent is better than its parent, this mutated clone replaces it in the population. 7. If an average fitness value of a new population in comparison with an actual population has significantly increased (let s say by fit value), let s start with point Specification of the mutual affinity (distance) of all the cells in expanded population. Elimination of the cells whose affinity towards the best cells has been smaller than suitably determined small value a (many redundant similar cells are eliminated by this process). 9. Random selection of a number of n pop cells and completion of a number of (1- ).n pop from a group of defined original cells into a new population; 0<<1, we can also use some of other selection methods mentioned in the chapter 4.3). The following relationship is valid for mutation x x N e t1 t f ( x) where n x t+1 is a new mutated cell of an original parent x t, 0<N<1 is a random number with normal distribution of probability is an invariable adapting the size of mutation to the specific application, f n (x) is a normalized fitness function, for which the following is valid: f n f ( x) min f ( x) ( x) max f ( x) min f ( x) where f(x) is a fitness in case of maximization. 34

36 Summary of terminology Evolutionary algorithms, evolutionary strategies, differential evolution, SOMA, artificial immune system. Questions Define evolutionary algorithms and compare them with genetic algorithms. Characterize methods of evolutionary algorithms. 35

37 6. DATA WAREHOUSES SOURCE OF KNOWLEDGE AND INFORMATION Time for studying 4 hours Objective Student will learn facts about the principles of data warehouses, the concept of data mining and tasks and methods of data mining. Lesson: explanation of the basic terminology The field of Business Intelligence (BI) and Data Warehousing (DW) are two fastest developing branches of software industry abroad as well as within Czech and Slovak market. This situation is caused by important changes in the sphere of entrepreneurship in the recent years. These changes are probably most cogently described in the quotation of Peter Drucker: Knowledge and information is the only one meaningful source. Traditional production factors land, labour and capital hasn t disappeared, they have only become second-class items. Nowadays information and knowledge are considered as the main producer of wealth WHY TO INVEST INTO CREATION OF DATA WAREHOUSES? Ralph Kimball s six-point definition of data warehouse (DW): DW offers users an access to company s data access can be considered from several points of view: - users must have an access to DW from their personal computers, - access must be viable, reliable and efficient at any time, - access must be easily available for users and user-friendly. DW is consistent the consistency can also be considered from more than one point of view: - two users asking about the same issue must be given identical answers under the same conditions, DW contains one definition of a specific issue (e.g. sale) which can be found in DW and is valid for all users, - data in DW are validated and 100% pure. Data in DW can be filtered and combined according to all possible measures in a corporation or an organization ( slice and dice ), this requirement leads to the implementation of multidimensional modelling; indicators (metrics, variables) are monitored according to the corporate measures (dimensions); slice and dice approach says that it is always possible to form a question which is formulated in the way so that any dimension or combination of dimensions can be placed into the lines of an imaginary chart, any dimension or combination of dimensions into columns or bars, and dimensions which are not observed are filtered. Data warehouse is not only means of data processing from databases, but it also includes 36

38 tools for data enquiry, analysis and presentation; results of DW project are not only technologies (hardware, RDBMS etc.) and data saved in database, but also applications and tools which have access to data and are able to present them in user-friendly form; applications have direct access to DW relational database and can use data processed beforehand in multidimensional databases OLAP. DW is a place where high-quality and validated data are published people responsible for DW database loading allow the use of only those data that are validated and complete; incomplete or incorrect data cannot be published. Quality of DW is a motivation for Business Process Reengineering (BPR) DW cannot correct low-quality data; data in DW database are sorted so that errors could be easily identified; based on what they see users can initiate the modification of processes how to obtain data so that there are less errors. From the technological point of view the data warehouse is considered as one of means used in the process of creation of information system which includes more technologies. Among them there are: DW database typically relational database, ETL tools (Extraction, Transformation, Loading) data pumps for data transfer into DW, Data marts, Unitary storage of metadata of data warehouse individual components (metadata repository) OLAP tools, Application for end-users (MIS), Tools for analysis, enquiry and generation of reports Data-mining in the context of Business Intelligence Analytical needs in corporations are growing to support either management s strategic decision-making process (management information system) or operative decision-making of employees from marketing departments, call-centres or risk management departments. These needs lead to the storage of data, including their history, in data warehouse which is characterized by de-normalized data scheme where same pieces of information are stored in several places. This enables quick answers to complex analytical questions. Furthermore, while data are recorded from primary systems into data warehouses they are cleaned and possible errors are eliminated. Usually, every group of users has its specialized data mart in DW. These specialized data marts provide their individual groups of users with the only pieces of information they just need. To manage these data marts OLAP tools, reports, and other techniques like data mining are used. In the typical structure of a data warehouse there is Operational Data Store (ODS) which serves as a database of actual data without history for contact CRM, and staging area or zero level of the data warehouse which is purely technical and is made for effective transfer of data into the data warehouse. 37

Structure of data warehouse including data mining. 6.1.2. What is data mining? The concept of data mining, or mining in data, is defined in various ways by various authors.

39 Structure of data warehouse including data mining What is data mining? The concept of data mining, or mining in data, is defined in various ways by various authors. One of the simplest and shortest definitions can be the following one: Data mining is the process of finding valuable information among dozens of fields in large databases. Usama M. Fayyad uses more complex definition of this concept: Data mining is nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns or correlations in data. Although this definition describes the essential idea of data mining, it can be very difficult for non-experts to imagine what happens during this process, or why this new scientific domain has come into being, if statistics actually does the same. Data mining is not an application or a tool. We can find more definitions of this concept, but they mostly unanimously say that it is a nontrivial process of identifying valid, novel (or hidden), potentially useful and ultimately understandable patterns and correlations in data. In majority of cases it concerns large volumes of databases. Methods of data mining are based on statistics, new findings from artificial intelligence or machine learning. This includes nontrivial methods which has a common feature of trying to present the identified results in the form that is the most accessible to end-user. These are e.g. data clouds with similar characteristics, decision trees or simple rules. 38

The Genetic Algorithm for finding the maxima of single-variable functions

Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 46-54 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.com The Genetic Algorithm for finding