Inducing Probabilistic Grammars by Bayesian Model Merging

Size: px

Start display at page:

Download "Inducing Probabilistic Grammars by Bayesian Model Merging"

Marilyn Stevenson
5 years ago
Views:

1 To pper in ICGI-94 Inducing Proilistic Grmmrs y Byesin Model Merging Andres Stolcke Stephen Omohundro Interntionl Computer Science Institute 1947 Center St., Suite 600 Berkeley, CA E-mil: fstolcke,omg@icsi.erkeley.edu Astrct We descrie frmework for inducing proilistic grmmrs from corpor of positive smples. First, smples re incorported y dding d-hoc rules to working grmmr; susequently, elements of the model (such s sttes or nonterminls) re merged to chieve generliztion nd more compct representtion. The choice of wht to merge nd when to stop is governed y the Byesin posterior proility of the grmmr given the dt, which formlizes trde-off etween close fit to the dt nd defult preference for simpler models ( Occm s Rzor ). The generl scheme is illustrted using three types of proilistic grmmrs: Hidden Mrkov models, clss-sed n-grms, nd stochstic context-free grmmrs. 1 Introduction Proilistic modeling hs ecome incresingly importnt for pplictions such s speech recognition, informtion retrievl, mchine trnsltion, nd iologicl sequence processing. The types of models used vry widely, rnging from simple n-grms to Hidden Mrkov Models (HMMs) nd stochstic context-free grmmrs (SCFGs). A centrl prolem for these pplictions is to find suitle models from corpus of smples. Most common proilistic models cn e chrcterized y two prts: discrete structure (e.g., the topology of n HMM, the context-free ckone of SCFG), nd set of continuous prmeters which determine the proilities for the words, sentences, etc. descried y the grmmr. Given the discrete structure, the continuous prmeters cn usully e fit using stndrd methods, such s likelihood mximiztion. In the cse of models with hidden vriles (HMMs, SCFGs) estimtion typiclly involves expecttion mximiztion (EM) (Bum et l. 1970; Dempster et l. 1977; Bker 1979). In this pper we ddress the more difficult first hlf of the prolem: finding the discrete structure of proilistic model from trining dt. This tsk includes the prolems of finding

2 the topology of n HMM, nd finding the set of context-free productions for n SCFG. Our pproch is clled Byesin model merging ecuse it performs successive merging opertions on the sustructures of model in n ttempt to mximize the Byesin posterior proility of the overll model structure, given the dt. In this pper, we give n introduction to Byesin model merging for proilistic grmmr inference, nd demonstrte the pproch on vrious model types. We lso report riefly on some of the pplictions of the resulting lerning lgorithms primrily in the re of nturl lnguge modeling. 2 Byesin Model Merging Model merging (Omohundro 1992) hs een proposed s n efficient, roust, nd cognitively plusile method for uilding proilistic models in vriety of cognitive domins (e.g., vision). The method cn e chrcterized s follows: Dt incorportion: Given ody of dt X, uild n initil model M 0 y explicitly ccommodting ech dt point individully such tht M 0 mximizes the likelihood P(XjM). The size of the initil model will thus grow with the mount of dt, nd will usully not exhiit significnt generliztion. Structure merging: Build sequence of new models, otining M i+1 from M i y pplying generliztion or merging opertor m tht colesces sustructures in M i, M i+1 = m(m i ), i = 0; 1; : : : The merging opertion is dependent on the type of model t hnd (s will e illustrted elow), ut it generlly hs the property tht dt points previously explined y seprte model sustructures come to e ccounted for y single, shred structure. The merging process therey grdully moves from simple, instnce-sed model towrd one tht expresses structurl generliztions out the dt. To guide the serch for suitle merging opertions we need criterion tht trdes off the goodness of fit of dt X ginst the desire for simpler, nd therefore more generl models. As formliztion of this trdeoff, we use the posterior proility P(M jx) of the model given the dt. According to Byes rule, P(MjX) = P(M)P(XjM) P(X) the posterior is proportionl to the product of prior proility term P(M) nd likelihood term P(XjM) (the denomintor P(X) does not depend on M nd cn therefore e ignored for the purpose of mximizing). The likelihood is defined y the model semntics, wheres the prior hs to e chosen to express the is, or prior expecttion, s to wht the likely models re. This choice is domin-dependent nd will e elorted elow. Finlly, we need serch strtegy to find models with high (mximl, if possile) posterior proility. A simple pproch here is Best-first serch: Strting with the initil model (which mximizes the likelihood, ut usully hs very low prior proility), explore ll possile merging steps, nd successively choose the one (greedy serch) or ones (em serch) tht give the gretest ;

3 0:5 I 0: F I 0:5 1 0: F I F I F I F Figure 1: Model merging for HMMs. immedite increse in posterior. Stop merging when no further increse is possile (fter looking hed few steps to void simple locl mxim). In prctice, to keep the working models of mngele size, we cn use n on-line version of the merging lgorithm, in which the dt incorportion nd the merging/serch stges re interleved. We now mke these concepts concrete for vrious types of proilistic grmmr. 3 Model merging pplied to proilistic grmmrs 3.1 Hidden Mrkov Models Hidden Mrkov Models (HMMs) re proilistic form of non-deterministic finite-stte models (Riner & Jung 1986). They llow prticulrly strightforwrd version of the model merging pproch. Dt incorportion. For ech oserved smple crete unique pth etween the initil nd finl sttes y ssigning new stte to ech symol token in the smple. For exmple, given the dt X = f; g, the initil model M 0 is shown t the top of Figure 1.

4 Merging. In single merging step, two old HMM sttes re replced y single new stte, which inherits the union of the trnsitions nd emissions from the old sttes. Figure 1 shows four successive merges (where ech new stte is given the smller of the two indices of its predecessors). The second, third nd fifth models in the exmple hve smller model structure without chnging the generted proility distriutions, wheres the fourth model effectively generlizes from the finite smple set to the infinite set f() n ; n > 0g. The crucil point is tht ech of these models cn e found y loclly mximizing the posterior proility of the HMM, under wide rnge of priors (see elow). Also, further merging in the lst model structure shown produces lrge penlty in the likelihood term, therey decresing the posterior. The lgorithm thus stops t this point. Prior distriutions. Our pproch hs een to choose reltively uninformtive priors, which spred the prior proility cross ll possile HMMs without giving explicit preference to prticulr topologies. A model M is defined y its structure (topology) M S nd its continuous prmeter settings M. The prior my therefore e decomposed s P(M) = P(M S )P( M jm S ) : Model structures receive prior proility ccording to their description length, i.e., P(M S ) / exp( `(M S )); where `(M S ) is the numer of its required to encode M S, e.g., y listing ll trnsitions nd emissions. The prior proilities for M, on the other hnd, re ssigned using Dirichlet distriution for ech of the trnsition nd emission multinomil prmeters, similr to the Byesin decision tree induction method of Buntine (1992). (The prmeter prior effectively spreds the posterior proility s if certin numer of evenly distriuted virtul smples hd een oserved for ech trnsition nd emission.) For convenience we ssume tht the prmeters ssocited with ech stte re priori independent. There re three intuitive wys of understnding why simple priors like the ones used here led to higher posterior proilities for simpler HMMs, other things eing equl: Smller topologies hve smller description length, nd hence higher prior proility. This corresponds to the intuition tht lrger structure needs to e picked from mong lrger rnge of possile eqully sized lterntives, thus mking ech individul choice less prole priori. Lrger models hve more prmeters, thus mking ech prticulr prmeter setting less likely (this is the Occm fctor (Gull 1988)). After two sttes hve een merged, the effective mount of dt per prmeter increses (the evidence for the merged sustructures is pooled). This shifts nd peks the posterior distriutions for those prmeters closer to their mximum likelihood settings. These principles lso pply muttis mutndis to the other pplictions of model merging inference.

5 Posterior computtion. Recll tht the trget of the inference procedure is the model structure, hence the gol is to mximize the posterior P(M S jx) / P(M S )P(XjM S ) The mthemticl reson why one wnts to mximize P(M S jx), rther thn simply P(MjX), is tht for inference purposes model with high posterior structure represents etter pproximtion to the Byes-optiml procedure of verging over ll possile models M, including oth structures nd prmeter settings (see Stolcke & Omohundro (1994:17f.) for detils). The evlution of the second term ove involves the integrl over the prmeter prior, P(XjM S ) = Z M P( MjM S )P(XjM S ; M )d M ; which cn e pproximted using the common Viteri ssumption out smple proilities in HMMs (which in our cse tends to e correct due to the wy the HMM structures re initilly constructed). Applictions nd results. We compred the model merging strtegy pplied to HMMs ginst the stndrd Bum-Welch procedure when pplied to fully prmeterized, rndomly initilized HMM structure. The ltter represents one potentil pproch to the structure finding prolem, effectively turning it into prmeter estimtion prolem, ut it fces the prolem of locl mxim in the prmeter spce. Also, in Bum-Welch pproch the numer of sttes in the HMM hs to e known, guessed or estimted in dvnce, wheres the merging pproch chooses tht numer dptively from the dt. Both pproches need to e evluted empiriclly. First, we tested the two methods on few simple regulr lnguges tht we turned into HMMs y ssigning uniform proilities to their corresponding finite-stte utomt. Trining proceeded using either rndom or structure covering sets of smples. The merging pproch relily inferred these dmittedly simple HMM structures. However, the Bum-Welch estimtor turned out to e extremely sensitive to the initil prmeter settings nd filed on more thn hlf of the trils to find resonle structure, oth with miniml nd redundnt numers of sttes. 1 Second, we tested merging nd Bum-Welch (nd numer of other methods) on set of nturlly occurring dt tht might e modeled y HMMs. The tsk ws to derive phonetic pronuncition models from ville trnscriptions in the TIMIT speech dtse. In this cse, the Bum-Welch-derived model structures turned out to e close in generliztion performnce to the slightly etter merged models (s mesured y cross-entropy on test set). 2 However, to chieve this performnce the Bum-Welch HMMs mde use of out twice s mny trnsitions s the more compct merged HMMs, which would hve serious impct on potentil pplictions of such models in speech recognition. Finlly, the HMM merging lgorithm ws integrted into the trining of medium-scle spoken lnguge understnding system (Wooters & Stolcke 1994). Here, the lgorithm lso serves the purpose of inducing multi-pronuncition word models from speech dt, ut it is now coupled with seprte process tht estimtes the coustic emission likelihoods for the HMM sttes. The gol of this setup ws to improve the system s performnce over comprle 1 Cse studies of the structures, under-generliztions nd overgenerliztions found in this experiment cn e found in Stolcke & Omohundro (1994). 2 rgued tht this domin is slightly simpler, since it contins, for exmple, no looping HMM structures.

6 system tht used only the stndrd single-pronuncition HMMs for ech word, while remining prcticl in terms of trining cost nd recognition speed. By using the more complex, merged HMMs the word error ws indeed reduced significntly (from 40.6% to 32.1%), indicting tht the pronuncition models produced y the merging process were t lest dequte for this kind of tsk. 3.2 Clss-sed n-grm Models Brown et l. (1992) descrie method for uilding clss-sed n-grm models from dt. Such models express the trnsition proilities etween words not directly in terms of individul word types, ut rther etween word ctegories, or clsses. Ech clss, in turn, hs fixed emission proilities for the individul words. One potentil dvntge of this pproch is tht it cn drsticlly reduce the numer of prmeters ssocited with ordinry n-grm models, y effectively shring prmeters etween similrly distriuted words. To infer word clsses utomticlly, Brown et l. (1992) suggest n lgorithm tht successively merges clsses ccording to mximum-likelihood criterion, until trget numer of clsses is reched. From our perspective we cn cst their lgorithm s n instnce of model merging, the essentil difference eing the non-byesin (likelihood-sed) criterion guiding the merging nd stopping. In fct, in retrospect, clss merging in n-grm grmmrs cn e understood s specil cse of HMM merging. A clss-sed n-grm model cn e strightforwrdly expressed s specil form of HMM in which ech clss corresponds to stte, nd trnsition proilities correspond to clss n-grm proilities. 3.3 Stochstic Context-Free Grmmrs Bsed on the model merging pproch to HMM induction, we hve extended the lgorithm to pply to stochstic context-free grmmrs (SCFGs), the proilistic generliztion of CFGs (Booth & Thompson 1973; Jelinek et l. 1992). A more detiled description of SCFG model merging cn e found in Stolcke (1994). Dt incorportion. To incorporte new smple string into SCFG we cn simply dd top-level production (for the strt nonterminl S) tht covers the smple precisely. For exmple, the grmmr t the top of Figure 2 rises from the smples f; ; g. Insted of letting terminl symols pper in production right-hnd sides, we lso crete one nonterminl for ech oserved terminl, which simplifies the merging opertors. Merging. The ovious nlog of merging HMM sttes is the merging of nonterminls in SCFG. This is indeed one of the strtegies used to generlize given SCFG, nd it cn potentilly produce inductive leps y generting grmmr tht genertes more thn its predecessor, while reducing the size of the grmmr. However the hllmrk of context-free grmmrs re the hierrchicl, center-emedding structures they cn represent. We therefore introduce second opertor clled chunking. It tkes given sequence of nonterminls nd revites it using newly creted nonterminl, s illustrted y the sequence AB in the second grmmr of Figure 2. In tht exmple, one more chunking step, followed y two merging steps produces grmmr for the lnguge

7 S! AB! AABB! AAABBB A! B! Chunk (AB)! X: S! X! AXB! AAXBB X! AB Chunk (AXB)! Y : S! X! Y! AY B X! AB Y! AXB Merge S; Y : S! X! ASB X! AB Merge S; X: S! AB! ASB Figure 2: Model merging for SCFGs. f n n ; n > 0g. (The proilities in the grmmr re implicit in the usge counts for ech production, nd re not shown in the figure.) Priors. As efore, we split the prior for grmmr M into contriution for the structurl spects M S, nd one for the continuous prmeter settings M. The gol is to mximize the posterior of the structure given the dt, P(M S jx). For P(M S ) we gin use description length-induced distriution, otined y simple enumertive encoding of the grmmr productions (ech occurrence of nonterminl contriutes log N its to the description length, where N is the numer of nonterminls). For P( M jm S ) we oserve tht the production proilities ssocited with given left-hnd side form multinomil, nd so we use symmetricl Dirichlet priors for these prmeters.

8 Lnguge Smple no. Grmmr Serch Prentheses 8 S! () j (S) j SS BF 2n 5 S! j SS BF () n 5 S! j SS BF n n 5 S! j S BF wcw R ; w 2 f; g 7 S! c j S j S BS(3) Addition strings 23 S! j j (S) j S + S BS(4) Shpe grmmr 11 S! dy j Y S Y! j cy BS(4) Bsic English 25 S! I m A j he T j she T j it T! they V j you V j we V! this C j tht C T! is A V! re A Z! mn j womn A! there j here C! is Z j ZT BS(3) Tle 1: Test grmmrs from Cook et l. (1976). Serch methods re indicted y BF (est-first) or BS(n) (em serch with width n). Serch. In the cse of HMMs, greedy merging strtegy (lwys pursuing only the loclly most promising choice) seems to give generlly good results. Unfortuntely, this is no longer true in the extended SCFG merging lgorithm. The chief reson is tht chunking steps typiclly require severl following merging steps nd/or dditionl chunking steps to improve grmmr s posterior score. To ccount for this compliction, we use more elorte em serch tht considers numer of reltively good grmmrs in prllel, nd stops only fter certin neighorhood of lterntive models hs een serch without producing further improvements. The experiments reported elow use smll em widths (etween 3 nd 10). Forml lnguge experiments. We strt y exmining the performnce of the lgorithm on exmple grmmrs found in the literture on other CFG induction methods. Cook et l. (1976) use collection of techniques relted to ours for inferring proilistic CFGs from smple distriutions, rther thn solute smple counts (see discussion in the next section). These lnguges nd the inferred grmmrs re summrized in Tle 1. They include clssic textook exmples of CFGs (the prenthesis lnguge, rithmetic expressions) s well s simple grmmrs ment to model empiricl dt. We replicted Cook s results y pplying the lgorithm to the sme smll sets of high proility strings s used in Cook et l. (1976). (The numer of distinct smple strings is given in the second column of Tle 1.) Since the Byesin frmework mkes use of the ctul oserved smple counts, we scled these to sum to 50 for ech trining corpus. The Byesin merging procedure produced the trget grmmrs in ll cses, using different

9 levels of sophistiction in the serch strtegy (s indicted y column 4 in Tle 1). Since Cook s lgorithm uses very different, non-byesin formliztion of the dt fit vs. grmmr complexity trde-off we cn conclude tht the exmple grmmrs must e quite roust to vriety of resonle implementtions of this trde-off. A more difficult lnguge tht Cook et l. (1976) list s eyond the scope of their lgorithm cn lso e inferred, using em serch: the plindromes ww R ; w 2 f; g. We ttriute this improvement to the more flexile serch techniques used. Nturl Lnguge syntx. An ovious question rising for SCFG induction lgorithms is whether they re sufficient for deriving dequte models from relistic corpor of nturlly occurring smples, i.e., to utomticlly uild models for nturl lnguge processing pplictions. Preliminry experiments on such corpor hve yielded mixed results, which led us to conclude tht dditionl methods will e necessry for success in this re. A fundmentl prolem is tht ville dt will typiclly e sprse reltive to the complexity of the trget grmmrs, i.e., not ll constructions will e represented with sufficient coverge to llow the induction of correct generliztions. We re currently investigting techniques to incorporte dditionl, independent sources of generliztion. For exmple, prt-of-speech tgging phse prior to SCFG induction proper could reduce the work of the merging lgorithm considerly. Given these difficulties with lrge-scle nturl lnguge pplictions, we hve resorted to smller experiments tht try to determine whether certin fundmentl structures found in NL grmmrs cn in principle e identified y the Byesin frmework proposed here. In Stolcke (1994) numer of phenomen re exmined, including Lexicl ctegoriztion Nonterminl merging ssigns terminl symols to common nonterminls whenever there is sustntil overlp in the contexts in which they occur. Phrse structure strction Stndrd phrsl ctegories such s noun phrses, prepositionl nd ver phrses re creted y chunking ecuse they llow more compct description of the grmmr y reviting common colloctions, nd/or ecuse they llow more succinct generliztions (in comintion with merging) to e stted. Agreement Co-vrition in the forms of co-occurring syntctic or lexicl elements (e.g., numer greement etween suject nd vers in English) is induced y merging of nonterminls. However, even in this lerning frmework it ecomes cler tht CFGs (s opposed to, sy, feture-se grmmr formlisms) re n indequte representtion for these phenomen. The usul low-up in grmmr size to represent greement in CFG form cn lso cuse the wrong phrse structure rcketing to e prefered y the simplicity is. Recursive phrse structure Recursive nd itertive productions for phenomen such s emedded reltive cluses cn e induced using the chunking nd merging opertors. We conclude with smll grmmr exhiiting recursive reltive cluse emedding, from Lngley (1994). The trget grmmr hs the form S --> NP VP VP --> Ver NP NP --> Art Noun --> Art Noun RC

10 RC --> Rel VP Ver --> sw herd Noun --> ct dog mouse Art --> the Rel --> tht with uniform proilities on ll productions. Chunking nd merging of 100 rndom smples produces grmmr tht is wekly equivlent to the ove grmmr. It lso produced essentilly identicl phrse structure, except for more compct implementtion of the recursion through RC: S --> NP VP VP --> V NP NP --> DET N --> NP RC RC --> REL VP DET --> --> the N --> ct --> dog --> mouse REL --> tht V --> herd --> sw 4 Relted work Mny of the ingredients of the model merging pproch hve een used seprtely in vriety of settings. Successive merging of sttes (or stte equivlence clss construction) is technique widely used in lgorithms for finite-stte utomt (Hopcroft & Ullmn 1979) nd utomt lerning (Angluin & Smith 1983); recent ppliction to proilistic finite-stte utomte is Crrsco & Oncin (1994). Bell et l. (1990) nd Ron et l. (1994) descrie method for lerning deterministic finitestte models tht is in sense the opposite of the merging pproch: successive stte splitting. In this frmework, ech stte represents unique suffix of the input, nd sttes re repetedly refined y extending the suffixed represented, s long s this move improves the model likelihood y certin minimum mount. The clss of models thus lernle is restricted, since ech stte cn mke predictions sed only on inputs within ounded distnce from the current input, ut the pproch hs other dvntges, e.g., the finl numer of sttes is typiclly smller thn for merging lgorithm, since the tendency is to overgenerlize, rther thn undergenerlize. We re currently investigting stte splitting s complementry serch opertor in our merging lgorithms. Horning (1969) first proposed using Byesin formultion to cpture the trde-off etween grmmr complexity nd dt fit. His lgorithm, however, is sed on serching for the grmmr with the highest posterior proility y enumerting ll possile grmmrs (such tht one cn

11 tell fter finite numer of steps when the optiml grmmr hs een found). Unfortuntely, the enumertion pproch proved to e infesile for prcticl purposes. The chunking opertion used in SCFG induction is prt of numer of lgorithms imed t CFG induction, including Cook et l. (1976), Wolff (1987), nd Lngley (1994), where it is typiclly pired with other opertions tht hve effects similr to merging. However, only the lgorithm of Cook et l. (1976) hs proilistic CFGs s the trget of induction, nd therefore merits closer comprison to our pproch. A mjor conceptul difference of Cook s pproch is tht it is sed on n informtiontheoretic qulity mesure tht depends only on the reltive frequencies of oserved smples. The Byesin pproch, on the other hnd, explicitly tkes into ccount the solute frequencies of the dt. Thus, the mount of dt ville not only its distriution hs n effect on the outcome. For exmple, hving oserved the smples ; ; ;, model of f n ; n > 0g is quite likely. On the other hnd, if the sme smples were oserved hundred times, with no other dditionl dt, such conclusion should e intuitively unlikely, lthough the smple strings themselves nd their reltive frequencies re unchnged. The Byesin nlysis confirms this intuition: 100-fold smple frequency entils 100-fold mgnifiction of the log-likelihood loss incurred for ny generliztion, which would lock the inductive lep to model for f n ; n > 0g. Incidentlly, one cn use smple frequency s principled device to control the degree of generliztion in Byesin induction lgorithm explicitly (Quinln & Rivest 1989; Stolcke & Omohundro 1994). 5 Future directions Since ll lgorithms presented here re of generte-nd-evlute kind, they re trivil to integrte with externl sources of constrints or informtion out possile cndidte models. Externl structurl constrints cn e used to effectively set the prior (nd therefore posterior) proility for certin models to zero. We hope to explore more informed priors nd constrints to tckle lrger prolems, especilly in the SCFG domin. In retrospect, the merging opertions used in our proilistic grmmr induction lgorithms shre strong conceptul nd forml similrity to those used y vrious induction methods for non-proilistic grmmrs (Angluin & Smith 1983; Skkir 1990). Those lgorithms re typiclly sed on constructing equivlence clsses of sttes sed on some criterion of distinguishility. Intuitively, the (difference in) posterior proility used to guide the Byesin merging process represents fuzzy, proilistic version of such n equivlence criterion. This suggests looking for other non-proilistic induction methods of this kind nd dpting them to the Byesin pproch. A promising cndidte we re currently investigting is the trnsducer inference lgorithm of Oncin et l. (1993). 6 Conclusions We hve presented Byesin model merging frmework for inducing proilistic grmmrs from smples, y stepwise generliztion from smple-sed d-hoc model through successive merging opertors. The frmework is quite generl nd cn therefore e instntited for vriety of stndrd or novel clsses of proilistic models, s demonstrted here for HMMs nd SCFGs.

12 The HMM merging vrint, which is empiriclly more relile for structure induction thn Bum-Welch estimtion, is eing used successfully in speech modeling pplictions. The SCFG version of the model lgorithm generlizes nd simplifies numer of relted lgorithms tht hve een proposed previously, thus showing how the Byesin posterior proility criterion cn comine dt fit nd model simplicity in uniform nd principled wy. The more complex model serch spce encountered with SCFGs lso highlights the need for reltively sophisticted serch strtegies. References ANGLUIN, D., & C. H. SMITH Inductive inference: Theory nd methods. ACM Computing Surveys BAKER, JAMES K Trinle grmmrs for speech recognition. In Speech Communiction Ppers for the 97th Meeting of the Acousticl Society of Americ, ed. y Jred J. Wolf & Dennis H. Kltt, , MIT, Cmridge, Mss. BAUM, LEONARD E., TED PETRIE, GEORGE SOULES, & NORMAN WEISS A mximiztion technique occuring in the sttisticl nlysis of proilistic functions in Mrkov chins. The Annls of Mthemticl Sttistics BELL, TIMOTHY C., JOHN G. CLEARY, & IAN H. WITTEN Text Compression. Englewood Cliffs, N.J.: Prentice Hll. BOOTH, TAYLOR L., & RICHARD A. THOMPSON Applying proility mesures to strct lnguges. IEEE Trnsctions on Computers C BROWN, PETER F., VINCENT J. DELLA PIETRA, PETER V. DESOUZA, JENIFER C. LAI, & ROBERT L. MERCER Clss-sed n-grm models of nturl lnguge. Computtionl Linguistics BUNTINE, WRAY Lerning clssifiction trees. In Artificil Intelligence Frontiers in Sttistics: AI nd Sttistics III, ed. y D. J. Hnd. Chpmn & Hll. CARRASCO, RAFAEL C., & JOSÉ ONCINA, Lerning stochstic regulr grmmrs y mens of stte merging method. This volume. COOK, CRAIG M., AZRIEL ROSENFELD, & ALAN R. ARONSON Grmmticl inference y hill climing. Informtion Sciences DEMPSTER, A. P., N. M. LAIRD, & D. B. RUBIN Mximum likelihood from incomplete dt vi the EM lgorithm. Journl of the Royl Sttisticl Society, Series B GULL, S. F Byesin inductive inference nd mximum entropy. In Mximum Entropy nd Byesin Methods in Science nd Engineering, Volume 1: Foundtions, ed. y G. J. Erickson & C. R. Smith, Dordrecht: Kluwer. HOPCROFT, JOHN E., & JEFFREY D. ULLMAN Introduction to Automt Theory, Lnguges, nd Computtion. Reding, Mss.: Addison-Wesley.

13 HORNING, JAMES JAY A study of grmmticl inference. Technicl Report CS 139, Computer Science Deprtment, Stnford University, Stnford, C. JELINEK, FREDERICK, JOHN D. LAFFERTY, & ROBERT L. MERCER Bsic methods of proilistic context free grmmrs. In Speech Recognition nd Understnding. Recent Advnces, Trends, nd Applictions, ed. y Pietro Lfce & Rento De Mori, volume F75 of NATO Advnced Sciences Institutes Series, Berlin: Springer Verlg. Proceedings of the NATO Advnced Study Institute, Cetrro, Itly, July LANGLEY, PAT, Simplicity nd representtion chnge in grmmr induction. Unpulished mss. OMOHUNDRO, STEPHEN M Best-first model merging for dynmic lerning nd recognition. Technicl Report TR , Interntionl Computer Science Institute, Berkeley, C. ONCINA, JOSÉ, PEDRO GARCÍA, & ENRIQUE VIDAL Lerning susequentil trnsducers for pttern recognition interprettion tsks. IEEE Trnsctions on Pttern Anlysis nd Mchine Intelligence QUINLAN, J. ROSS, & RONALD L. RIVEST Inferring decision trees using the minimum description length principle. Informtion nd Computtion RABINER, L. R., & B. H. JUANG An introduction to hidden Mrkov models. IEEE ASSP Mgzine RON, DANA, YORAM SINGER, & NAFTALI TISHBY The power of mnesi. In Advnces in Neurl Informtion Processing Systems 6, ed. y Jck Cown, Gerld Tesuro, & Joshu Alspector. Sn Mteo, CA: Morgn Kufmnn. SAKAKIBARA, YASUBUMI Lerning context-free grmmrs from structurl dt in polynomil time. Theoreticl Computer Science STOLCKE, ANDREAS, Byesin Lerning of Proilistic Lnguge Models. Berkeley, CA: University of Cliforni disserttion., & STEPHEN OMOHUNDRO Best-first model merging for hidden Mrkov model induction. Technicl Report TR , Interntionl Computer Science Institute, Berkeley, CA. WOLFF, J. G Cognitive development s optimistion. In Computtionl models of lerning, ed. y L. Bolc, Berlin: Springer Verlg. WOOTERS, CHUCK, & ANDREAS STOLCKE Multiple-pronuncition lexicl modeling in speker-independent speech understnding system. In Proceedings Interntionl Conference on Spoken Lnguge Processing, Yokohm.

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Lecture 10 Evolutionry Computtion: Evolution strtegies nd genetic progrmming Evolution strtegies Genetic progrmming Summry Negnevitsky, Person Eduction, 2011 1 Evolution Strtegies Another pproch to simulting