Synthesis of local thermo-physical models using genetic programming

Size: px

Start display at page:

Download "Synthesis of local thermo-physical models using genetic programming"

Lesley Davis
5 years ago
Views:

Unversty of South Florda Scholar Commons Graduate Theses and Dssertatons Graduate School 2009 Synthess of local thermo-physcal models usng genetc programmng Yng Zhang Unversty of South

edu/etd Part of the Amercan Studes Commons Scholar Commons Ctaton Zhang, Yng, "Synthess of local thermo-physcal models usng genetc programmng" (2009). Graduate Theses and Dssertatons.

1 Unversty of South Florda Scholar Commons Graduate Theses and Dssertatons Graduate School 2009 Synthess of local thermo-physcal models usng genetc programmng Yng Zhang Unversty of South Florda Follow ths and addtonal works at: Part of the Amercan Studes Commons Scholar Commons Ctaton Zhang, Yng, "Synthess of local thermo-physcal models usng genetc programmng" (2009). Graduate Theses and Dssertatons. Ths Dssertaton s brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for ncluson n Graduate Theses and Dssertatons by an authorzed admnstrator of Scholar Commons. For more nformaton, please contact scholarcommons@usf.edu.

2 Synthess of Local Thermo-Physcal Models Usng Genetc Programmng by Yng Zhang A dssertaton submtted n partal fulfllment of the requrements for the degree of Doctor of Phlosophy Department of Chemcal and Bomedcal Engneerng College of Engneerng Unversty of South Florda Major Professor: Aydn K. Sunol, Ph.D. John A. Llewellyn, Ph.D. Scott W. Campbell, Ph.D. Lus H. Garca-Rubo, Ph.D. Rafael Perez, Ph.D. Date of Approval: December 11, 2008 Keywords: data mnng, symbolc regresson, functon dentfcaton, parameter regresson, statstc analyss, process smulaton Copyrght 2009, Yng Zhang

3 ACKNOWLEDGMENTS I wsh to express my deepest grattude to Dr. Aydn K. Sunol, for hs contnuous gudance and encouragement throughout my Ph.D. experence. I would also lke to thank Dr. John A. Llewellyn, Dr. Scott W. Campbell, Dr. Lus H. Garca-Rubo and Dr. Rafael Perez for beng my commttee members and takng tme from ther busy schedules. Fnally, I would lke to thank my famly for ther help.

4 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES ABSTRACT v v v CHAPTER ONE: INTRODUCTION 1 CHAPTER TWO: LITERATURE REVIEW Revew of local models for phase partton coeffcents Revew of data mnng technques n the knowledge dscovery process Elements of structural regresson usng genetc programmng Termnal set, functon set and ntal representaton Ftness measures for models of varyng complexty Ftness functon wth no penalty for the model complexty Ftness functon wth model complexty control Mnmum descrpton length Parsmony pressure Ftness functon usng external valdaton Genetc operators Reproducton and crossover Mutaton Selecton strategy Parametrc regresson: revew of objectve functons and optmzaton methods Applcatons of ntellgent system n chemcal engneerng 43 CHAPTER THREE: A HYBRID SYSTEM FOR STRUCTURAL AND PARAMETRIC OPTIMIZATION The system structure Data and data preparaton The regresson strategy Implementaton wth MATLAB based genetc search toolbox The populaton archtecture The populaton sze 57

5 3.4.3 The prncpal component analyss and the selecton of termnal & functon sets Genetc operator Ftness evaluaton Selecton strategy Decmaton strategy Result desgnaton and termnaton crtera Summary Model evaluaton Statstcal analyss Graphcal methods Goodness of ft statstcs Cross valdaton and predcton Steady state and dynamc smulaton usng local models 78 CHAPTER FOUR: RESULTS AND DISCUSSIONS Local composton models and ther mpact Developed models for mxtures that form deal or near-deal solutons and ther statstcal analyss Performance of models for separaton of deal and near deal mxtures Developed models for mxtures that form non-deal solutons and ther statstcal analyss Performance of models for separaton of non-deal solutons Dscusson Ideal gaseous and lqud mxture Non-deal gaseous mxture and deal lqud mxture Ideal gaseous and non-deal lqud mxture Non-deal gaseous mxture and non-deal lqud mxture Lnear model vs. nonlnear model Extrapolaton 105 CHAPTER FIVE: CONCLUSIONS AND RECOMMENDATIONS 107 REFERENCES 110 APPENDICES 115 Appendx A. User Manual for GP Package 116 A.1 MATLAB fles 116 A.2 Archtecture 116 A.3 How to use the GP package 118

6 Appendx B. Statstcal Analyss 122 B.1 Tests for outlers 122 B.2 Evaluaton of fnal model: cross-valdaton 124 ABOUT THE AUTHOR End Page

7 LIST OF TABLES Table 4.1 Result Summary for Propylene (1)-Propane (2) 81 Table 4.2 Standard Error Analyss for Generated K 1 and K 2 Model 84 Table 4.3 Propane-Propylene Dstllaton Column Profle 88 Table 4.4 Result Summary for Acetone (1) -Water (2) 89 Table 4.5 Standard Error Analyss for Local Model for Acetone-Water System 92 Table 4.6 Tower Profle at Dfferent Runtme 98 Table 4.7 R 2 of Dfferent Models for K 1 n Group Tests 99 Table 4.8 R 2 of Extrapolaton Test 106 Table B.1 MATLAB Code for Outler Test 122 Table B.2 The Result for Outler Test 123 Table B.3 MATLAB Code for Data Set Partton 125 Table B.4 The Result for Data Set Partton 126 v

8 LIST OF FIGURES Fgure 2.1 Fgure 2.2 Flowchart for an Algorthm for Isothermal Flash Calculaton Algorthm that Uses Composton Dependent Local Models 8 Flowchart for an Algorthm for Isothermal Flash Calculaton Algorthm that Uses Equaton of State 9 Fgure 2.3 Archtecture of Local Model Strategy 10 Fgure 2.4 Knowledge Dscovery Process 15 Fgure 2.5 Example of a Regresson Tree 17 Fgure 2.6 A Smple Multlayer Neural Network Model wth Two Hdden Nodes 18 Fgure 2.7 A Flowchart for Computaton through Genetc Programmng 23 Fgure 2.8 Functonal Representaton of a + b T + c T 2 ( ) Usng a Tree Structure 24 Fgure 2.9 An Example of Reproducton Operator 32 Fgure 2.10 Crossover Operaton for an Algebrac Equaton Manpulaton 34 Fgure 2.11 An Example of Mutaton Operaton, Type I 34 Fgure 2.12 An Example of Mutaton Operaton, Type II 35 Fgure 3.1 A Hybrd System Structure for Structural and Parametrc Optmzaton 47 Fgure 3.2 The Structure for Lnear-Nonlnear Regresson Strategy 51 Fgure 3.3 A Schematc Dagram of the Genetc Search Methodology 54 Fgure 3.4 A Flowchart of Genetc Search, Steady State Populaton 55 C p v

9 Fgure 3.5 A Flowchart of Genetc Search, Generatonal Populaton 56 Fgure 3.6 Ftness vs. Generaton Usng Dfferent Populaton Szes 59 Fgure 3.7 Fgure 3.8 Ftness vs. Generaton Usng Dfferent Mutaton and Crossover Probabltes 63 The Standard Devaton of Ftness Usng Dfferent Mutaton and Crossover Probabltes 64 Fgure 3.9 Model s Accuracy vs. Generaton Usng Dfferent Ftness Functons 66 Fgure 3.10 Model s Complexty vs. Generaton Usng Dfferent Ftness Functons 66 Fgure 3.11 Model Ftness vs. Generaton Usng Dfferent Tournament Szes 68 Fgure 3.12 The Standard Devaton of Ftness vs. Generaton Usng Dfferent Tournament Szes 68 Fgure 3.13 Model s Accuracy vs. Generaton Usng Dfferent Deleton Strateges 71 Fgure 3.14 The Standard Devaton of Ftness vs. Generaton Usng Dfferent Deleton Strateges 72 Fgure 4.1 K 1 Model vs. Expermental Data for Propylene (1)-Propane (2) 82 Fgure 4.2 Resdual Plot of K 1 for Propylene (1)-Propane (2) 82 Fgure 4.3 K 2 Model vs. Expermental Data for Propylene (1)-Propane (2) 83 Fgure 4.4 Resdual Plot of K 2 for Propylene (1)-Propane (2) 83 Fgure 4.5 P-X 1 -Y 1 at Low Pressures for Propylene (1)-Propane (2) 85 Fgure 4.6 P-X 1 -Y 1 at Medum to Hgh Pressures for Propylene (1)-Propane (2) 85 Fgure 4.7 K 1 vs. Lqud Composton for Dfferent Temperatures for Propylene (1)-Propane (2) System 86 Fgure 4.8 K 1 vs. Pressure for Dfferent Temperatures for Propylene (1)- Propane (2) 86 Fgure 4.9 Temperature Profle of Propylene-Propane Dstllaton Column 88 v

10 Fgure 4.10 Relatve Volatlty Profle for Propylene-Propane Dstllaton Column 89 Fgure 4.11 K 1 Model vs. Expermental Data for Acetone (1)-Water (2) 90 Fgure 4.12 Resdual Plot of K 1 for Acetone (1)-Water (2) 90 Fgure 4.13 K 2 Model vs. Expermental Data for Acetone (1)-Water (2) 91 Fgure 4.14 Resdual Plot of K 2 for Acetone (1)-Water (2) 91 Fgure 4.15 T-X 1 -Y 1 at Dfferent Temperatures for Acetone (1)-Water (2) 93 Fgure 4.16 K 1 vs. Lqud Composton at Dfferent Pressures for Acetone (1)- Water (2) 94 Fgure 4.17 K 1 vs. Temperature at Dfferent Pressures for Acetone (1)-Water (2) 94 Fgure 4.18 Temperature Profle of Acetone-Water Dstllaton Column 96 Fgure 4.19 Relatve Volatlty Profle of Acetone-Water Dstllaton Column 96 Fgure 4.20 Dstllaton Total Flow Rate 97 Fgure 4.21 Dstllaton Column Pressure 98 Fgure A.1 Archtecture of MATLAB Code 117 v

11 SYNTHESIS OF LOCAL THERMO-PHYSICAL MODELS USING GENETIC PROGRAMMING Yng Zhang ABSTRACT Local thermodynamc models are practcal alternatves to computatonally expensve rgorous models that nvolve mplct computatonal procedures and often complement them to accelerate computaton for real-tme optmzaton and control. Human-centered strateges for development of these models are based on approxmaton of theoretcal models. Genetc Programmng (GP) system can extract knowledge from the gven data n the form of symbolc expressons. Ths research descrbes a fully data drven automatc self-evolvng algorthm that bulds approprate approxmatng formulae for local models usng genetc programmng. No a-pror nformaton on the type of mxture (deal/non deal etc.) or assumptons are necessary. The approach nvolves synthess of models for a gven set of varables and mathematcal operators that may relate them. The selecton of varables s automated through prncpal component analyss and heurstcs. For each canddate model, the model parameters are optmzed n the nner ntegrated nested loop. The trade-off between accuracy and model complexty s addressed through ncorporaton of the Mnmum Descrpton Length (MDL) nto the ftness (objectve) functon. v

12 Statstcal tools ncludng resdual analyss are used to evaluate performance of models. Adjusted R-square s used to test model s accuracy, and F-test s used to test f the terms n the model are necessary. The analyss of the performance of the models generated wth the data drven approach depcts theoretcally expected range of compostonal dependence of partton coeffcents and lmts of deal gas as well as deal soluton behavor. Fnally, the model bult by GP ntegrated nto a steady state and dynamc flow sheet smulator to show the benefts of usng such models n smulaton. The test systems were propane-propylene for deal solutons and acetone-water for nondeal. The result shows that, the generated models are accurate for the whole range of data and the performance s tunable. The generated local models can ndeed be used as emprcal models go beyond elmnaton of the local model updatng procedures to further enhance the utlty of the approach for deployment of real-tme applcatons. x

13 CHAPTER ONE INTRODUCTION Approaches to modelng of chemcal processes have changed sgnfcantly n the past three decades. In general, these approaches are dvded nto two generc categores. One s mechanstc modelng, whch s manly based on frst prncples and fundamental knowledge. The other s emprcal modelng, whch s data drven. In the latter, the model structure and ts assocated parameters are selected to represent the process data accurately for a gven range and am to brng ease through smplfed model development stage as well as reduced computatonal load. Data drven modelng technques have been popular for many decades. They are easer to develop than the mechanstc models, partcularly for practtoners. Ths s especally true when mechanstc frst prncples models and ther assocated thermophyscal propertes are not adequate n representng the real world problems. Furthermore, these mechanstc models are hghly nonlnear and complex, whch makes them dffcult to dentfy [Ramrez 1989] and mplement partcularly on-real-tme applcatons. Currently, the most of the data drven modelng methods fall under statstcal methods and artfcal neural networks headngs [Pöyhönen, 1996]. Neural networks usually provde models that are accurate n representng the data, but they don't provde any nsght nto represented phenomena. Usually, neural networks are black 1

14 boxes, and one cannot abstract the underlyng physcal relatonshps between nput and output data. It s often desrable to gan some nsght nto the underlyng structures, as well as make accurate numerc predctons. Applcaton of Genetc Programmng (GP) based approaches are known to produce nput-output models wth relatvely smple and transparent structures and the assocated procedures are coned wth symbolc regresson termnology. Genetc Programmng allows synthess of data drven models when model elements are represented as a tree structure. Ths tree structure s of varable length and conssts of nodes. The termnal nodes can be nput varables, parameters or constants whle thee non-termnal nodes are standard lbrary functons, lke addton, subtracton, multplcaton and dvson. Each tree structure may possbly descrbe an equaton. Genetc programmng works by emulatng natural evoluton to generate an optmum model structure that best maxmzes some ftness functon. Model structures evolve through the acton of operators known as reproducton, crossover and mutaton. Crossover nvolves nterchange of the branches from two parent structures. Mutaton s random creaton of a completely new branch. At each generaton, a populaton of model structures undergoes crossover, mutaton and selecton and then a ftness functon s evaluated. These operators mprove the general ftness of the populaton. Based on ftness, the next generaton s selected from the pool of old and new structures. The process repeats tself untl some convergence crteron s satsfed and a model s generated. 2

15 One prmary classfcaton used for property and process models n Chemcal Engneerng s based on algebrac versus dfferental equaton models [Franks 1967]. The mathematcal models are ether comprsed of a set of algebrac equatons for steady-state operaton or by a set of ordnary dfferental equatons (ODE) coupled wth algebrac equatons for dynamc (tme-dependent) models, or partal dfferental equatons (PDE) for dstrbuted models. The majorty of algebrac mathematcal models for physcal or engneered systems can be classfed n one of the followng three types [Englezos 2001]: Type I: A model wth a sngle dependent varable and a sngle ndependent varable. For example, heat capacty model for deal gas s a functon of temperature. Type II: A model wth a dependent varable and several ndependent varables, for example, a pressure-explct equaton of states (EOS) whch s enable the calculaton of flud phase equlbrum and thermo-physcal propertes such as enthalpy, entropy, and densty necessary n the desgn of chemcal processes. Mathematcally, a pressure-explct EOS expresses the relatonshp among pressure, volume, temperature, and composton for a flud mxture. Type III: A model wth multple dependent varables and several ndependent varables. A typcal group of applcatons s modelng of reacton knetcs where possble mechansm s depcted as multple reactons that are coupled through concentraton of speces. 3

16 The objectve of ths dssertaton s to develop a methodology, whch uses genetc operatons n order to fnd a symbolc relatonshp between a sngle dependent varable and multple ndependent varables,.e., Type II. The approach was demonstrated for Type I problems by Zhang [ 2004] earler. The structure and hence the complexty of the model or the equaton s not specfed lke n the conventonal regresson, whch seeks to fnd the best set of parameters for a pre-specfed model. The goal s to seek a mathematcal expresson, n symbolc form, whch fts or approxmates a gven sample of data usng genetc programmng (GP). The approach s called Symbolc Regresson. The nested two ter approach s proposed n ths research where parameter regresson method s embedded wthn GP. The GP s employed to optmze the structure of a model, whle classcal numercal regresson s employed to optmze ts parameters for each proposed structure. The model structure and ts parameters are unknown, and determned for each step through the algorthm. Model s adequacy s tested through post analyss. The approach s tested for a practcal and sgnfcant problem: development of local and/or emprcal partton coeffcent models for vapor lqud separaton. For accurate chemcal process desgn and effectve operaton, a correct estmate of physcal and thermodynamc propertes s a prerequste. The estmaton of these propertes through frst prncple but complex mplct models for pure components and mxtures s computatonally costly. The computatonal tme s crtcal partcularly n real tme applcatons. The phase equlbrum calculatons are the most computatonally 4

17 ntensve of these propertes due to mplct nature of procedures wth more complex property models, especally when used wth rgorous separaton models [Leesley 1977]. Local thermodynamc models are explct functons that approxmate more rgorous models that nvolve mplct computatonal procedures n equlbrum calculatons. Computatons wth these functons are fast and non-teratve at tmes, but are only vald n a lmted regon where the functons are accurate. Therefore, local models need to be updated as the smulaton moves nto new regons n the state spaces. Snce the late seventes, many functonal forms wth dfferng ndependent varable sets for these models were suggested and some have been mplemented wthn flow sheet smulator packages [Perregaard 1992, Storen 1994, and Storen 1997]. Ths ntroductory chapter s followed by Chapter Two, where local models, data mnng applcatons and technologes, evolutonary algorthms and ther applcatons n chemcal engneerng, and optmzaton methods and ther objectve functons are revewed. Chapter Three descrbes the proposed system structure, gudelnes for determnaton of GP controllng parameters, and the detals of mplementng the approach. Results and dscusson are gven n Chapter Four. Fnally, n Chapter Fve, concluson and recommendatons are presented. 5

18 CHAPTER TWO LITERATURE REVIEW Ths chapter ncludes the revew of local models for vapor-lqud partton coeffcent (K value), data mnng tasks and technques, evolutonary algorthms and optmzaton methods. In the frst secton, the development of local models s summarzed. The revew on data mnng technologes s gven n the second secton. The thrd secton descrbes the development of evolutonary algorthms and the comparson of dfferent algorthms. More emphass s gven to genetc programmng. A bref summary of applcatons of ntellgent system n chemcal engneerng s also gven at the end of ths secton. In the fourth secton, some popular optmzaton methods and pertnent objectve functons (crtera) are revewed. 2.1 Revew of local models for phase partton coeffcents A correct estmate of physcal and thermodynamc propertes s a prerequste for the accurate chemcal process desgn and operaton. The calculaton of those propertes of pure components and mxtures contrbutes the major cost n computer tme. Local thermodynamc models are practcal alternatves to computatonally expensve, more rgorous macroscopc or molecular models that nvolve mplct computatonal 6

19 procedures, and often complement them to accelerate computaton for run tme optmzaton and control. Snce vapor-lqud equlbrum constant K s among the most computatonally expensve one [Leesley 1977], the research efforts on developng local models focus on the vapor-lqud equlbrum constant K. Snce the late seventes, several research groups developed local thermodynamc models and accompanyng procedures to be mplemented wthn flow sheet smulator packages. The objectve s to replace, or assst more rgorous thermodynamc models wth local alternatves to reduce the computer tme whle mantanng the thermodynamc accuracy at an acceptable level. Local thermodynamc models have been used to accelerate steady state calculaton [Leesley et al. 1977, Chmowtz et al. 1983, Perregaard 1993], dynamc smulaton [Chmowtz et al. 1984, Perregaard 1993], and dynamc optmzaton [Storen 1997]. The more rgorous thermodynamc models are nonlnear equaton sets whch nvolve teratve calculatons for vapor-lqud equlbrum constant (K) model. The local models are n explct form, and lnear wth respect to ts parameters. Ther calculaton procedures are shown n Fgure 2.1 whle the flowchart of sothermal, sobarc flash calculaton usng an equaton of state s shown n Fgure 2.2. As can be seen, the explct local models are much easer and faster to evaluate, but they are only vald locally. The local models must be updated, f the smulaton proceeds out of the regon where the local model s vald. The mplementaton of the dea nvolves three major components: local model formulaton, error montor and parameter update, as shown n Fgure

20 Specfy T, P (of equlbrum), and feed mole fractons z Intal estmate of x and y Calculate K ( x, y, T, P) New estmate of x and y, f not drect substtuton. Calculate L, by solvng f ( L) = (1 K ) z /[ L + K (1 L)] = 0 Calculate x = z /[ L + K (1 L)] y = K x (=1,2,...,n) Not converged Compare estmated and calculated values of x and y Converged Fgure 2.1 Flowchart for an Algorthm for Isothermal Flash Calculaton Algorthm that Uses Composton Dependent Local Models 8

21 Specfy T, P (of equlbrum), and feed mole fractons z Guess set of (=1,2,...,n) K Calculate L, by solvng 0 f ( L) = (1 K ) z /[ L + K (1 L)] = 0 Calculate x = z /[ L + K (1 L)] y = K x (=1,2,...,n) Calculate l Then, f T, P, x ) l Z usng x, T and P, ( Calculate v Then, f T, P, y ) v Z usng y, T and P. ( Is l T, P, x ) f = v f T, P, y )? ( ( K = K new old f f l v No Yes Soluton for L, and x ( = 1,2,..., n), y ( = 1,2,..., n), Fgure 2.2 Flowchart for an Algorthm for Isothermal Flash Calculaton Algorthm that Uses Equaton of State 9

22 Smulator (Process Model) Explct TP model Rgorous TP model Error Montor Parameter Update Fgure 2.3 Archtecture of Local Model Strategy The frst component of local model based system development s to formulate the approxmate local functon. Leesley [1977] developed several local models for deal solutons, whch ddn t nclude composton dependence. He derved the local K model from the complete form: K = y x = γ Φ 0 v P s L P V dp exp 0 RT 0 V Φˆ P S P L V dp RT (2.1) After smplfcaton, through deal soluton assumpton and avodng complex functonal forms, an approxmaton to Eq. (2.1) can be developed for low pressure. s ln K = A1, ln P ( T ) + A2, ln P (2.2) 10

23 Eq. (2.2) reproduces the temperature dependence of K-values farly well over the range of C. However, the relaton s too approxmate to be useful above the pressures of 2-3 bars. Thus, a thrd adjustable coeffcent has been ntroduced n the approxmaton formula for hgh pressure applcatons: A1, ln K = + A2, A3, ln P (2.3) T Chmowtz [1983] extended the local models to non-deal solutons for multcomponent vapor-lqud system. One of the essental deas has been to treat multcomponent mxtures as pseudo-bnary solutons. The functonal form used to model the K values, whch s composton-dependent for each pseudo-bnary, has been also derved from basc thermodynamc consderatons. In Chmowtz s work, he presented a local model for non-deal solutons: ln K A 2 s = (1 x ) + ln P ln P (2.4) RT Lender [1994] used a sequental least squares procedure to buld approxmatng formulae from a general model that contans all the terms necessary to represent any partcular mxture: n A P ln K = A A x (2.5) n s A2 ln P ( T ) + + A4 + A5 ln P + A5 + (1 x ) + T T = 1 = 1 5+ n+ The problem wth ths formula s that t has too many parameters (5+2n) to be effcent. To elmnate the unnecessary terms, Eq. (2.5) s rewrtten n the form of: T 1 T A = ( Q Q) Q F (2.6) 11

24 Eq. (2.6) s the least squares soluton of the Eq. (2.5), where A s the vector of local model parameters, F s the vector of ln(k) obtaned from expermental data, Q s the T matrx of terms. If one lets C = ( Q Q) 1, then, Cj Corr j = (2.7) C C jj If ths correlaton s found to be hgher than a specfed tolerance, one of the two parameters wll be elmnated from the correspondng lne and column n matrx C. Stepwse regresson strategy s appled. For each parameter ntroducton, the parameters are re-computed and the resduals are examned. If they are satsfactory, the local model s accepted. If not, the parameter s elmnated before ntroducng the next one. When all parameters have been examned, the ones that gave the lowest resduals are accented. The second component s the error montor to estmate the range of valdty of the local models for a set of parameters and dentfy when to update the local model. Leesley and Heyen [1977] fxed upper and lower values of the two ndependent varables, T and P. The bounds defned an nterval, whch ncluded the two data ponts used n calculatng the parameters. Hllestad et al. [1989] and Storen [1994] developed dfferent error models for predctng the devaton between local and rgorous thermodynamc property models. The thrd component s parameter estmaton for updatng models as the range of model have to change. Macchetto [1986] and Hllestad et al. [1989] appled recursve least-squares methods. The objectve here s to preserve nformaton from past data n the covarance matrx for the parameters. When a new data pont s ntroduced, the covarance matrx can be updated and new values for the parameters can be obtaned. 12

25 Storen [1994] used a smplfed scheme wth correcton factors. Ledent [1994] presented a sequental least squares procedure. In summary, two major approaches were developed to synthesze local thermodynamc models. One method s to derve a relatonshp based on a thermodynamc nsght. Assumptons are made to smplfy the relatonshp [Leesley 1977, Chmowtz 1983]. Each formula s sutable for a partcular type of solutons (deal/non deal etc.). The fnal structure of formula mostly ncludes one constant term, one term that accounts for the temperature nfluence, and one term accounts for the pressure nfluence. In the case of non-deal solutons, one or more terms may be added, to account for the composton nfluence. The other approach to the emprcal formulaton s on evaluated statstcal bass,.e. provde a general form of local model, whch ncludes all the terms descrbed above and ther combnatons, and then elmnate the redundant terms by examnng the correlaton for every par of parameters [Ledent 1994]. Both approaches are human-centered strateges, and they share a common task of developng local models. An ntal functon structure s proposed frst, and then the functon structure s smplfed and reduced by applyng dfferent strateges. The humancentered approaches may brng some lmtatons to the fnal structure of local models due to the over-smplfed structure ntroduced by napproprate assumptons made for the procedure of smplfcaton, or, the nsuffcent descrpton of the studed system ntroduced by proposed ntal structure. The form of local model s mportant because t s closely related to the correlaton capabltes of the local model. It s also very mportant that the local model 13

26 ensures a fast, robust and consstent evaluaton of the parameters. For ths study, we are nterested n a fully automatc algorthm, whch can develop formula suffcently and flexbly to all soluton mxture types and functonal forms. Ths can be obtaned wth symbolc regresson through genetc programmng. 2.2 Revew of data mnng technques n the knowledge dscovery process In ths study, genetc programmng s used as a data mnng tool for knowledge dscovery n data. Knowledge dscovery n database (KDD) s the nontrval process of searchng vald, novel, potentally useful and ultmately understandable patterns or models n data. It nvolves a number of steps [Thurasngham 1999]. For the sake of smplcty, these steps can be grouped as three major stages: data pre-processng, data mnng, and postprocessng. The smplfed flowchart s shown n Fgure 2.4. Data mnng, here, refers to a partcular step n overall knowledge dscovery process. As the core stage of KDD process, t focuses on applyng dscovery algorthms to fnd understandable and useful relatonshps from observed data. The data mnng tasks can be dvded nto three major categores [Hand et al. 2001]: model buldng, dscoverng pattern and rules, and retreval by content. The tasks of model buldng can be categorzed further based on objectves. The frst s descrptve modelng. The goal of a descrptve model s to descrbe all of the data. Examples of such descrptons nclude models for the overall probablty dstrbuton of 14

27 the data (densty estmaton), parttonng of the n-dmensonal space nto groups (cluster analyss), and models descrbng the relatonshp between varables (dependency modelng). The second one s predctve modelng. The am of predctve modelng s to buld a model that wll allow the value of one varable to be predcted from the known values of other varables. The key dstncton between predcton and descrpton s that predcton has, as ts objectve, one or more than one specfcally targeted varables, whle n descrptve problems no sngle varable s central to the model. Classfcaton and regresson are two of the most popular applcatons n predctve modelng. In classfcaton, the varable beng predcted s categorcal, whle n regresson the varable s quanttatve. From a data mnng vewpont, ths study can be set n the category of predctve modelng, whch nvolves both model structure and parameter regresson to buld a local model for vapor-lqud equlbrum coeffcent K. Raw Data Data Preprocessng: *Data ntegraton *Data cleanng *Feature Selecton Data Mnng Tasks: *Descrptve modelng *Predctve modelng *Dscoverng patterns and rules * Retreval by content Examne results: Model Testng & Statstcs Analyss Model To clean out: *ncomplete/mprecse Objectves data *nosy data *Mssng attrbute values *Redundant or nsgnfcant data Data Mnng technques: *Statstcs/regresson/optmzaton *Evolutonary computng *Neural networks * Regresson Tree Fgure 2.4 Knowledge Dscovery Process 15

28 There are many dfferent data mnng technques. In ths secton, only a few of technques that are applcable for predctve modelng are summarzed and compared. These nclude lnear regresson, regresson tree, neural network, genetc algorthm and genetc programmng. Snce the structure of the lnear model s smple, easy to nterpret, and estmaton of parameters for lnear models s straghtforward, lnear regresson holds a specal place among data-drven data analyss methods. A lnear regresson model can be represented as: y = a n 0 + a X (2.8) = 1 where the a s are parameters that need to be estmated by fttng the model to the gven data set. X can smply be orgnal predctor varables x, or more generalzed form of f(x ),.e., transformatons of the orgnal x varables. f(x ) could be smooth functon, such as log, square-root, or cross-product terms of x s for polynomal models whch allows nteracton among the x s n the model. The parameter estmaton for lnear regresson model s straghtforward through least square fttng. However, selectng a proper model structure to ft the data s a challenge. Ths s because the selected model s generally emprcal, rather than frst prncple. The model may not nclude all of the predctor varables, or certan functons of the predctor varables, that are needed for correct predcton. 16

29 Regresson tree (RT) can be vewed as a varant of decson trees. It s desgned for approxmatng real-valued functons, nstead of beng used for classfcaton as what tradtonal decson tree does. Regresson tree has representaton as Fgure 2.5: x 1 x 1 <=3 x 1 >3 y=10 x 2 x 2 <=1 x 2 >1 y=2 y=5 Fgure 2.5 Example of a Regresson Tree Regresson tree s bult through a process known as bnary recursve parttonng. Ths s an teratve process that splts the data nto parttons, and then splttng t up further on each of the branches. In the structure of regresson tree, each ntermedate node s decson node that contans a test on one predctor varable's value. The termnal nodes of the tree contan the predcted output varable values. The objectve functon for buldng an optmum tree structure,.e. the mnmzed functon, s the mean absolute. The process of regresson tree nducton usually has two phases: buldng a tree structure that covers the tranng data, and prunng the tree to the best sze usng valdaton data set. In tranng process, at each node, the best splt that 17

30 mnmzes the mean absolute dstance s selected. Parttonng contnues untl a prespecfed mnmum number of tranng data are covered by a node, or untl the mean absolute dstance wthn a node s zero. "Prunng" nvolves choppng off nodes from the bottom up so that there are fewer and fewer branches n the tree, so, the regresson tree s pruned to avod the over-fttng. In terms of performance, regresson tree s extremely effectve n fndng the key attrbutes n hgh dmensonal applcatons. In most applcatons, these key features are only a small subset of the orgnal feature set. On the negatve sde, regresson trees cannot represent compactly many smple functons, for example lnear functons. A second weakness s that the regresson tree model s dscrete, yet predcts a contnuous varable. For functon approxmaton, the expectaton s a smooth contnuous functon, but a decson tree provdes dscrete regons that are dscontnuous at the boundares. For ts explanatory capablty, regresson tree cannot descrbe the relatonshp between output varable and predctor varables n a form of functons. Input Hdden layer Output x 1 x 2 w 1 w 2 w 3 w 4 h 1 v 1 v 2 y x 3 w 5 w 6 h 2 Fgure 2.6 A Smple Multlayer Neural Network Model wth Two Hdden Nodes 18

31 Neural networks have been found to be useful because of ther learnng and generalzaton abltes. Model structure presented by neural network s multple layers of nonlnear transformatons of weghted sums of the nput varables. In a sngle hdden layer network as shown n Fgure 2.6, w and v are weght factors, h s nonlnear transformaton of sum of weghted nput varables x. The output varable y s sum of weghted h. Therefore, n general, output varable y s a nonlnear functon of the nput varables x. As a result, neural network can be used as a nonlnear model for regresson. If there s more than one hdden layers, the outputs from one layer, whch s the transformed lnear combnatons of nodes n prevous layer, serve as nputs to the next layer. In ths next layer, the nputs are combned n exactly the same way,.e., each node forms a weghted sum that s then nonlnearly transformed. The number of layers and the number of nodes per layer are mportant decsons. There s no lmt to the number of layers that can be used, though t can be proven that a sngle hdden layer (wth enough nodes n that layer) s suffcent to model any contnuous functons [Hand 2001]. Once a network has been structured for a partcular applcaton, ths network s ready to be traned. The weghts w and v are the parameters of ths model and must be determned from the data n tranng process. The fact that neural network s hghly parameterzed makes t very flexble, so that t can accurately model relatvely small rregulartes n functons. On the other hand, such flexblty means that there s a serous danger of over fttng. In recent years, strateges have been developed for overcomng ths problem. Due to the multple layers of nonlnear transformaton of weghted sum, the relatonshp between output varable y 19

32 and nput varables x s hard to be presented n a sngle explct form of mathematcal model, the neural network s usually used as a black box for predctve modelng. Evolutonary algorthms (EAs) provde an effectve avenue for structural and parametrc regresson. EAs are orgnally dvded nto three major categores, namely evolutonary programmng (EP) [Fogel et al. 1966], evoluton strategy [Rechenberg 1973] and genetc algorthms (GAs) [Holland 1975]. In the 1990s, a new branch called genetc programmng (GP) was added to the group whch was ntroduced by John Koza [Koza 1992, 1994]. GP s an extenson of John Holland s GA n whch the genetc populaton conssts of models of varyng complextes and structures. GA uses bnary strng to represent possble solutons to a problem, whereas GP uses tree structure as knowledge representaton. Both GA and GP gude the search by usng some genetc operators and the prncple of survval of the fttest. The major dfference between GA and GP s ther codng used to represent possble solutons for a problem. In GA, the soluton s presented n a form of fxed length bnary strng, and ts output s a quantty. The am of such codng s to allow the possble solutons to be manpulated wth those genetc operators n evolutonary process. Sometmes, t s a challenge to encode the possble solutons n a structure of bnary strng. GP uses tree structure wth varable szes, whch allows the soluton to be manpulated n ther current form. Therefore, GP can be used as a tool for symbolc regresson,.e. structural regresson. The detals of GP wll be explaned n the next secton. 20

33 2.3 Elements of structural regresson usng genetc programmng The objectve of ths research s to fnd the approxmate functon for K, whch ncludes parametrc and structural regresson. As mentoned earler, Genetc programmng (GP) s an extenson of the genetc algorthm n whch the genetc populaton conssts of possble solutons (that s, compostons of prmtve functons and termnals). Koza [1992] demonstrated a surprsng result that, genetc programmng s capable of symbolc regresson. To accomplsh ths, genetc programmng starts wth a pool of randomly generated mathematcal models and genetcally breeds the populaton usng the Darwnan prncple of survval of the fttest and an analog of naturally occurrng genetc crossover (sexual recombnaton) operaton. In other words, genetc programmng provdes a way to search the space of possble model structures to fnd a soluton that fts, or approxmately fts, a gven data set. Genetc programmng s a doman ndependent method that genetcally breeds populatons of models to ft the gven data set by executng the followng three steps that are also shown n Fgure 2.7: Generate an ntal populaton of random ndvduals (mathematcal models) composed of the prmtve functons and termnals of the problem. Iteratvely perform the followng ntermedate-steps untl the termnaton crteron has been satsfed: 21

34 o Execute each ndvdual n the populaton and assgn t a ftness value accordng to how well t solves the problem. o Create a new populaton of ndvduals by applyng the followng three prmary operatons. The operatons are appled to ndvdual(s) n the populaton selected wth a probablty based on ftness (.e., the ftter the ndvdual, the more lkely t s to be selected). Reproducton: Copy an exstng ndvdual to the new populaton. Crossover: Create two new offsprng ndvduals for the new populaton by genetcally recombnng randomly chosen parts of two exstng ndvduals. The genetc crossover (sexual recombnaton) operaton (descrbed below) operates on two parental ndvduals and produces two offsprng ndvduals usng parts of each parent. Mutaton: randomly alteraton n exstng ndvduals, and produces one offsprng ndvduals. The sngle best ndvdual n the populaton produced durng the run s desgnated as the result of the run of genetc programmng. Ths result may be the soluton (or approxmate soluton) ftted to the gven data set. The descrpton on GP s components wll be gven n the followng subsectons, whch ncludes termnal set, functon set, ftness functon, genetc operators and selecton strateges. 22

35 Gen = 0 Create Intal Populaton Evaluate ftness of each ndvdual n populaton Termnate? Yes Done N = 0 Reproducton N=N+1 Select Genetc Operaton Probablstcally Crossover N=N+2 Mutaton N=N+1 N=Populaton Sze? Yes Gen=Gen+1 Fgure 2.7 A Flowchart for Computaton through Genetc Programmng Termnal set, functon set and ntal representaton In genetc programmng, any explct mathematcal equatons can be represented by a tree that ntermedate nodes are mathematcal operators (functons), and termnal nodes (leaves) are nput varables and parameters. 23

36 As shown n Fgure 2.8, the tree correspondng to the equaton of deal heat capacty C p 2 = a + b T + c T can be represented as: + + * a * c ^ b T T 2 Fgure 2.8 Functonal Representaton of a + b T + c T 2 ( ) Usng a Tree Structure C p In ths graphcal depcton, the functon set conssts of ntermedate nodes of the tree that are labeled wth several mathematcal operators, such as +,* and ^. The termnal set conssts of termnal nodes (leaves) of the tree that are labeled wth nput varables T, parameters a, b, c and constant 2. The termnal and functon sets are mportant components of genetc programmng. The termnal and functon sets contan the prmtve elements of the mathematcal model to be composed. The suffcency property requres that the set of termnals and the set of prmtve functons should be capable of expressng a soluton to the problem. 24

37 2.3.2 Ftness measures for models of varyng complexty The most dffcult and most mportant concept of genetc programmng s the ftness functon. The ftness functon determnes how well a generated model s ft to the data. Ftness s the drvng force of genetc programmng. In genetc programmng, each ndvdual model n a populaton s assgned a ftness value Ftness functon wth no penalty for the model complexty The basc ftness functon s a functon of the dfference between the model predcted value and the data. Wdely used basc ftness functons nclude the raw ftness, adjusted ftness and normalzed ftness. The raw ftness s the sum of squared errors. In partcular, the raw ftness r (, t) of an ndvdual model n the populaton of sze M at any generaton t s N = e [ S(, j) C( j) ] 2 r(, t) (2.9) j= 1 where S (, j) s the value returned by ndvdual model for data case j (of N e data cases) and C (j) s the data value for data case j. The closer ths sum of squared errors s to zero, the better the model. Another popular raw ftness s absolute error. Snce squared error gves greater weght to extreme dfferences between the predcted value and the data than absolute error does, the qualty of the model s perhaps more approprately reflected n absolute 25

38 error for some cases. Each raw ftness value can be adjusted (scaled) to produce an adjusted ftness measure a (, t). The adjusted ftness value s 1 a(, t) = (1 + r(, t)) (2.10) where r (, t) s the raw ftness for ndvdual model at generaton t. Unlke raw ftness, the adjusted ftness s larger for better ndvduals n the populaton. Moreover, the adjusted ftness les between 0 and 1. Each such adjusted ftness value a (, t) s then normalzed. The normalzed ftness value n (, t) s a(, t) n(, t) = M (2.11) a( j, t) j= 1 The normalzed ftness not only ranges between 0 and 1 and s larger for better ndvduals n the populaton, but the sum of the normalzed ftness values s 1. Thus, normalzed ftness s a probablty value Ftness functon wth model complexty control It was noted that, after a certan number of generatons, the average sze of the mathematcal models n a populaton would start growng at a rapd pace. However, the ncrease n complexty of model doesn t show sgnfcant mprovement on ftness. Ths behavor dsplayed by GP s called bloat. Bloat often occurs n symbolc regresson problems where GP runs start from a populaton of small sze ndvduals, and then grows 26

39 n complexty to be able to comply wth all data. In practce, bloat affects the effcency of GP sgnfcantly. The over-complcated model structures are computatonally expensve to evolve or use, t also can be hard to nterpret, and may dsplay poor ablty of generalzaton,.e. overfttng problem. Over the years, many theores have been proposed to explan bloat from dfferent aspects, but none of them s unversally accepted as a unfed theory to explan the varous observatons on bloat. Therefore, several strateges for control of complexty were proposed wth dfferent theoretcal foundatons Mnmum descrpton length As descrbed earler, the problem of over fttng s a common problem to every applcaton n GP. Besdes the complexty of the generated model by GP, the presence of nose n the data s another possble cause of over fttng. Good results wth nosy data are only achevable at the cost of precson on the entre data dstrbuton. The ssue of selectng a model of approprate complexty to overcome the over fttng problem s, therefore, always a key concern n any GP applcaton. One of proposed strateges s the Mnmum Descrpton Length (MDL), whch provdes a trade-off between the accuracy and the complexty of the model by ncludng a structure estmaton term for the model. The fnal model (wth the mnmal MDL) s optmum n the sense of beng a consstent estmate of the complexty of model whle achevng the mnmum error. 27

40 There are a number of crtera that have been proposed for MDL, whch compare models based on a measure of goodness of ft penalzed by model complexty. The most popular and wdely used crtera are Akake's Informaton Crteron (AIC) (Eq. (2.12)) and Schwarz Bayesan Informaton Crteron (BIC) (Eq. (2.13)). Akake's nformaton crteron (AIC) n/2 ln (MSE) + k. (2.12) Schwarz Bayesan Informaton Crteron (BIC) n /2 ln( MSE ) + k ln( n )/2. (2.13) where MSE s the mean squared predcton error, MSE = [1/n] SSE, n s the number of the data ponts used for the dentfcaton of the model,.e. the sample sze. Both AIC and BIC take the form of a penalzed maxmzed lkelhood, and ther frst term can be nterpreted as the evaluaton on model s accuracy, the second term s the penalty term, whch can be nterpreted as the complexty of the model, whch s a functon of the number of parameters and the depth of the tree structure that represents the model. These two crtera utlze dfferent penaltes: AIC adds 1 for each addtonal varable ncluded n a model, whle BIC adds ln (n)/ Parsmony pressure A varety of practcal technques have been proposed to control complexty bloat. Among these technques, parsmony pressure method s a smple and frequently used 28

41 method to control bloat n genetc programmng [Zhang et al. 1993, Zhang et al. 1995]. In ths method, the parsmony pressure s appled to the orgnal ftness functon: f p ( x) = f ( x) c l( x) (2.14) f p (x) s the new ftness functon wth parsmony pressure term. f (x) s the orgnal ftness of model x, as mentoned above. c s a constant, known as the parsmony coeffcent. l (x) s the sze of model x, counted as the number of ntermedate nodes n the tree representaton,.e., the number of mathematcal operators appeared n the model. Ths new ftness functon f p (x) mnmzes model sze by usng the penalty term as a mld constrant. The penalty s smply proportonal to model sze. The ftness of models wll decrease wth the ncrease on model sze. The strength of control over bloat s determned by the parsmony coeffcent c. The value of ths coeffcent s very mportant: f c has a small value, GP runs wll stll bloat wldly; f the value s too large, GP wll take the sze of the mnmzaton model as ts man target and wll almost gnore ftness, whch ncurs the loss of model accuracy, consequently, weaken the predcton ablty of model. However, the proper values of parsmony coeffcent hghly depend on specfc problem beng solved, the choce of functons and termnals, and varous GP parameter settngs. Very few theores have been proposed to help settng the parsmony coeffcent, and tral and error method was wdely used before Pol [2008] ntroduced a smple, effectve, and theoretcally sound soluton to ths problem. The strateges ntroduced above are ones focusng on the ftness aganst complexty bloat, and here-upon, over fttng problem. Other than those ant-bloat selecton rules, numerous emprcal technques have also been proposed to control 29

42 complexty bloat, whch are based on GP algorthm s mprovements. Brefly, these technques can be summarzed nto two major categores: sze and depth lmts [Koza 1992, and ant-bloat genetc operators [Knnear 1993, Langdon 1998, Langdon 2000, and Crawford-Marks et al. 2002] Ftness functon usng external valdaton Secton ntroduces two of the well-accepted strateges on the complexty control. Although based on dfferent theores, they both combne multple objectves nto a scalar ftness functon. A dfferent strategy for choosng models s sometmes used, not based on addng a penalty term, but nstead based on external valdaton of the model. The basc dea s to randomly splt the data nto two parts, a tranng set and a valdaton set. The tranng set s used to construct the models and estmate the parameters. Then, the ftness functon s recalculated usng the valdaton set. These valdaton scores are used to select models. In the valdaton context, snce the tranng set and valdaton data set are ndependently and randomly selected, for a gven model the valdaton score provdes an unbased estmate of the ftness value of that model for new data ponts, therefore, the dfference n valdaton scores can be used to choose between models. Ths general dea of valdaton has been extended to the noton of cross-valdaton. Ths external valdaton method wll be dscussed further n secton

43 2.3.3 Genetc operators In ths secton, three genetc operators wll be descrbed n detal. In the frst secton, reproducton and crossover wll be ntroduced as two prmary operatons, and the mutaton, ncludng ts two dfferent types, wll be ntroduced as a secondary operaton Reproducton and crossover The two prmary genetc operatons n GP for modfyng the structures are ftness proportonate reproducton, as shown n Fgure 2.9, and crossover, as shown n Fgure The operaton of ftness proportonate reproducton for the genetc programmng s an asexual operaton n that t operates on only one parental ndvdual (model). The result of ths operaton s one offsprng ndvdual (model). In ths operaton, f f (, t) s the ftness of an ndvdual n the populaton M at generaton t, the ndvdual wll be coped nto the next generaton wth probablty M j= 1 f (, t) f ( j, t) (2.15) The operaton of ftness proportonate reproducton does not create anythng new n the populaton. It ncreases or decreases the number of occurrences of ndvduals already n the populaton, and mproves the average ftness of the populaton (at the expense of the genetc dversty of the populaton). To the extent, t ncreases the number 31

44 of occurrences of more ft ndvduals and decreases the number of occurrences of less ft ndvduals. The crossover (recombnaton) operaton for the genetc programmng starts wth two parental ndvduals (models). Both parents are selected from the populaton wth a probablty equal to ts normalzed ftness. The result of the crossover operaton s two offsprng ndvduals (models). Unlke ftness proportonate reproducton, the crossover operaton creates new ndvduals n the populatons. Parent: a T + b Offsprng: a T + b + + * b * b T a T a Fgure 2.9 An Example of Reproducton Operator The operaton begns by randomly and ndependently selectng one pont n each parent usng a specfed probablty dstrbuton (dscussed below). The number of ponts n two parental ndvduals typcally s not equal to each other. As wll be seen, the crossover operaton s well-defned for any two ndvduals. That s, for any two ndvduals and any two crossover ponts, the resultng offsprng are always vald ndvduals n the populaton. Each offsprng contans some trats from ts parent. 32

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.