Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data

Size: px
Start display at page:

Download "Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data"

Transcription

1 Contrary to Popular Belef Incremental Dscretzaton can be Sound, Computatonally Effcent and Extremely Useful for Streamng Data Geoffrey I. Webb Faculty of Informaton Technology, Monash Unversty, Vctora, Australa Abstract Dscretzaton of streamng data has receved surprsngly lttle attenton. Ths mght be because streamng data requre ncremental dscretzaton wth cutponts that may vary over tme and ths s perceved as undesrable. We argue, to the contrary, that t can be desrable for a dscretzaton to evolve n synchronzaton wth an evolvng data stream, even when the learner assumes that attrbute values meanngs reman nvarant over tme. We examne the ssues assocated wth dscretzaton n the context of dstrbuton drft and develop computatonally effcent ncremental dscretzaton algorthms. We show that dscretzaton can reduce the error of a classcal ncremental learner and that allowng a dscretzaton to drft n synchronzaton wth dstrbuton drft can further reduce error. I. INTRODUCTION It s surprsng that dscretzaton of numerc data streams has receved lttle attenton. One reason may be that the cut ponts are lkely to have to change as the stream progresses, because the dstrbuton of values may vary. Ths may bas potental users aganst usng dscretzaton because t may seem unntutve to use dscretzed values whose meanng changes over tme. We argue, to the contrary, that changng over tme the cut ponts assocated wth each dscretzed value mght sometmes be necessary f the nterval s to retan the relevant meanng for a gven task. Ths paper nvestgates dscretzaton of numerc stream data. We present two effcent and effectve ncremental dscretzaton algorthms. The frst approxmates equal frequency dscretzaton over the entre stream to the tme step. The second uses a wndow of recent values and performs equal frequency dscretzaton on these, allowng the cut ponts to exactly track a non-statonary dstrbuton. Our experments demonstrate that dscretzaton can reduce error for the stateof-the-art streamng learner Logstc Regresson (LR) wth Stochastc Gradent Descent. We further demonstrate that for some streamng data t s ndeed useful to have dscretzatons whose cut ponts change over tme, trackng the evoluton of the underlyng concepts. II. DISCRETIZATION FOR STREAMING DATA We wsh to ncrementally update a model Θ to predct the posteror probablty dstrbuton P (y x ) of the classes y {c 1,... c k } for objects x = x 1,..., xa whle vewng a large or nfnte stream S = {x 1,...x n } of objects. We use Θ to denote the model at tme step and P Θ (y x ) to denote the class dstrbuton predcted by model Θ for object x. We assume that the true class y for each x becomes avalable after x s classfed and can be used for subsequent tranng of the classfer. The attrbute values x j of the objects may be ether categorcal or numerc. A dscretzaton δ of a numerc attrbute X s a set of m ntervals called bns. These bns can be defned by cut ponts {κ 1,..., κ m 1 }. These cut ponts dvde the doman of X nto bns b 1... b m usng a scheme such as b 1 = [, κ 1 ], b m = (κ m 1, ] and for 1 < < m, b = (κ 1, κ ]. A dscretzaton of attrbute X defnes a mappng between values v of X and bn ndexes, δv = z such that v b z. Dscretzaton s closely related to both hstograms and quantles. A hstogram of a numerc attrbute X wth respect to a dataset S can be vewed as a dscretzaton of X augmented wth a vector of counts η 1,... η m such that η k represents {j : x j b k}, the number of records whose value for the attrbute falls wthn the bn. A p th quantle Q p of an attrbute X wth respect to S s a value such that {j : x j < Qp } /n < p {j : x j > Q p } /n < 1 p. That s, t s the value of x pn f the data were sorted on the attrbute. If pn s not an nteger then the p th quantle may be any value n [x pn, x pn ] and s often set to x pn + (x pn x pn )/2. III. ISSUES IN DISCRETIZATION FOR STREAMING DATA The cut ponts for dscretzaton of streamng data may need to change over tme. Ths s because the process that generates the stream S may be non-statonary, n whch case t s not gong to be possble to antcpate what the future dstrbuton of values for an attrbute wll be and hence mpossble to predetermne what ntervals wll be relevant n the future. If the ntervals are predetermned and reman statc then they are lkely to eventually lose relevance. However, such changes to the ntervals over tme may appear undesrable, as they seem to mply that the meanngs of the bns must change. We suspect that ths has been a key reason why there has been lttle prevous research nto dscretzaton for streamng data. However, ths concern may be msguded. If a dstrbuton s non-statonary then t actually may be desrable for the dscretzaton to drft n synchronzaton wth the changes n the dstrbuton. For example, consder a stream of data that ncludes an ncome attrbute. The values of ths attrbute can be expected to grow over tme. For at least some applcatons t seems credble that we should want the dscretzaton to reflect ths evoluton. For example, t may be necessary for the cut

2 pont on a bn representng hgh ncome to ncrease over tme f that nterval s to retan ts relevant meanng. A further ssue s that some algorthms do not requre contnuty over tme n the bns that are used. For example, Nave Bayes [1] requres at classfcaton tme estmates of the pror probablty of each class, P (y ) and of the lkelhood of each attrbute value gven the class P (x j y ). These can be derved from counts of the relatve frequency of each class and of each par of class and attrbute value. It s not relevant what the ntervals were for prevous classfcatons, only that these necessary statstcs be avalable for the current dscretzaton. Hence, Nave Bayes can be well served by a technque that mantans a sutable augmented hstogram over tme and t s rrelevant whether the number of bns or ther cut ponts change. Rather, the key ssue s whether the counts are suffcently accurate for effectve classfcaton [2]. On the other hand, most dscrmnatve learnng algorthms do not operate n ths manner and do requre that the number of bns and ther meanng be constant over tme. For such algorthms, quantle-based dscretzaton, such as equal frequency dscretzaton, may be effectve. Ths unsupervsed dscretzaton strategy requres that the number of bns, m, be pre-specfed, together wth a set of quantles that specfy the cutponts. For equal frequency dscretzaton the range of attrbute X s dvded nto m bns, each contanng the same number of tranng examples, that s, nto bns b 1,... b m such that k, l (1, m) {j : x j b k} = {j : x j b l}. Ths s drectly related to the problem of fndng quantles, as the k th bn has an nterval (Q k 1 m, Q k m ]. Quantle-based dscretzaton allows at least one type of meanng of an nterval to reman nvarant even whle the cut ponts change. Consder agan the case of an attrbute for ncome. Suppose t s dscretzed nto three bns, the lower, mddle and upper thrds of ncome. If a streamng dscretzaton algorthm s able to mantan such a dscretzaton over tme, varyng the cut ponts as needed, at least one potentally mportant meanng of the ntervals wll reman constant. Supervsed dscretzaton often results n more useful bns than unsupervsed approaches [3]. However, supervsed dscretzaton does not appear feasble for dscrmnatve learners n a streamng context, as the cuts selected by a supervsed approach may vary dramatcally over tme and classcal dscrmnatve learners cannot track and adjust for ths. In contrast quantle-based dscretzaton can mantan a constant set of bns, each wth a meanng that remans nvarant even whle the cut values that defne the bns drft. If meanngful quantles can be dentfed for a learnng problem then these should be used. However, we show that even when such nformaton s not known, smple equal frequency dscretzaton can be effectve. It may appear counter-ntutve that dscretzaton should mprove the performance of a learnng algorthm that can handle numerc values drectly, because t s clear that dscretzaton loses nformaton. However, even though a dscretzed varable contans less nformaton than the undscretzed orgnal, the models that a learner forms may be able to employ that nformaton more effectvely. Consder for example a smple lnear model such as created by Logstc Regresson. Such a model requres that the TABLE I. UPDATE SAMPLES globals s: the sample sze n: the number of nstances seen n the stream to date V : a vector of set of samples, ndexed by attrbute 1: procedure UPDATESAMPLES(x = x 1,..., x a ) 2: f rand() s/n then 3: for = 1 to a do 4: f x s not mssng then 5: f V = s then 6: remove a random element from V 7: end f 8: add x to V 9: end f 10: end for 11: else 12: for = 1 to a do 13: f V < s and x s not mssng then 14: add x to V 15: end f 16: end for 17: end f 18: end procedure predctveness of a numerc value be proportonal to ts value. It cannot drectly model the case where only unusually hgh values are ndcatve of one class, and average or low values are all equally ndcatve of the other, or where average values are ndcatve of one class and ether hgh or low values ndcatve of the other. By dscretzng the attrbute and then treatng each dscrete value as a bnary varable a lnear classfer can model the predctveness of ndvdual segments of the number lne rrespectve of ther relatve absolute values. IV. INCREMENTAL DISCRETIZATION ALGORITHM IDA The Incremental Dscretzaton Algorthm (IDA) approxmates quantle-based dscretzaton on the entre data stream encountered to date by mantanng a random sample of the data whch s used to calculate the cut ponts. A random sample s used because: 1) t s not feasble for hgh-throughput streams to mantan a complete record of all values observed to date; 2) t s computatonally effcent; and 3) t s possble to place tght bounds on the expected varance of the cut ponts [4]. We use the reservor samplng algorthm [5] to mantan the random sample of s values V for each attrbute. The frst s values of each X are added to the correspondng V. Thereafter, when the n th object x n, y n s encountered, wth probablty s/n, each of ts values x n replaces a randomly selected value of the correspondng V. See Table I. We store the values of each attrbute n a vector of nterval heaps [6], where V j stores the values for the j th bn of X. Ths provdes effcent access to the mnmum and maxmum values n a bn, and drect access to a random value wthn a bn when replacng a value selected at random. Ths data structure ensures that nserton and deleton are of order O(log s) and retrevng a cut pont s constant tme. The algorthm for nsertng a value v nto V s presented n Table II. Recall that m s the number of bns. V j and V j denote, respectvely the maxmum and mnmum value n V j. Lne 2 uses bnary

3 TABLE II. globals m: the number of bns 1: procedure INSERTVALUE(v, V ) 2: t = V mod m 3: j = argmn j V j v 4: nsert v nto V j 5: f j < t then 6: for k = j to t 1 do 7: add V k to V k+1 8: remove V k from V k 9: end for 10: else 11: for k = t to j 1 do 12: add V k+1 13: remove V k+1 14: end for 15: end f 16: end procedure INSERT VALUE to V k from V k+1 search to fnd the bn n whch the value belongs. Lne 3 fnds the target bn the next bn that should ncrease n sze. The value s nserted nto the approprate bn. If t s not the target bn, the excess value s shuffled up or down to the target. Deleton s a mnor varaton on nserton. The cut ponts are accessed n constant tme by returnng the maxmum value of the approprate bn. IDA mantans a random sample of the stream from ts begnnng to the current pont of tme. As suggested n the ntroducton, n some contexts t mght be valuable to have the ntervals drft, so that the actual cut pont assocated wth the lowest range of ncome, for example, drfts upwards as nflaton ncreases ncomes. IDA s ntervals wll drft over tme to reflect overall changes n the total dstrbuton to date. However, t does not drectly track the current dstrbuton. A varant that more precsely tracks the evoluton of a data stream s to mantan S as a wndow of the s most recent objects. In ths case the dscretzaton wll change as the dstrbuton changes, but wll be more subject to frequent random mnor fluctuatons than a more gradual update approach. We call the latter approach the Incremental Dscretzaton Algorthm wth a Wndow (IDAW). Ths requres the addtonal overhead of mantanng for each value the wndow of values n tme order so that the oldest value can at each step be dentfed and replaced by the newest value. A. Computatonal Complexty The computatonal complexty of IDA s domnated by the costs of mantanng the samples and determnng the quantles from those samples. The requred operatons are to nsert a new value (only requred whle the sample s not yet at full sze), to replace a random value wth a new value, and to return the requred quantles. As each bn s mantaned as an nterval heap [6], fndng the quantles takes constant tme and nsertng or removng a value from a bn V j takes O(log V j ) = O(log(s/m)) tme. As replacement requres up to m nsertons and deletons, replacement requres order O(m log(s/m)) tme. However, these relatvely expensve updates are only requred on average once every s/t updates, where t s the current tme step or sze of the stream to date. Thus the amortzed cost s O([ s =1 m log /m + t =s+1 s m log s/m]/t), where the frst term represents the ntal s tme steps durng whch the sample s bult up to ts operatng sze and the second term represents updates to the sample once t reaches operatng sze. It s readly apparent that these updates rapdly become very rare and that as the sze of the stream becomes very large the amortzed cost becomes neglgbly small. The stuaton s more complex for IDAW, whch mantans a wndow of the s most recent values for each attrbute. Ths requres that the values be mantaned n both tme and value order. Mantanng an order by tme can be acheved very effcently wth a crcular buffer, whch supports all updates and accesses n constant tme. As the elements to be replaced n a replacement operaton are no longer selected at random, t s not effcent to mantan the bns as nterval heaps, as above. Rather we need to use slghtly more expensve balanced bnary trees for whch the tme to dentfy the locaton of the value to be removed s O(log(s/m)), whch ths does not ncrease the overall complexty of the update operaton relatve to that for IDA. The major computatonal penalty, however, s that these updates must be performed for every object encountered n the queue, whch makes the mantenance of the dscretzaton a non-trval ongong overhead. V. RELATED RESEARCH As we have noted above, mantanng the /m-quantles for each 1 m s the key requrement n order to dscretze a data stream nto m equal frequency bns. These quantles provde the requred cut ponts. Algorthms exst for fndng approxmate quantles n data streams wth strct bounds on the error [7] and [8]. However, they rely on the records n the stream appearng n random order, a requrement that s lkely to be strongly volated n many learnng applcatons. Ths renders these algorthms napproprate for our purposes. A dscretzaton technque should be matched to the propertes of the learnng algorthm. A number of papers have nvestgated dscretzaton of streamng data n the context of nave Bayes (NB) [9] [11]. NB s an unusual algorthm n that the model t learns for categorcal data can be represented n the form of an augmented hstogram, requrng counts of both the frequency of each attrbute value and the jont frequency of each combnaton of an attrbute and a class value. As a consequence t does not matter f there s a change n ether the number of values of an attrbute or the meanng of an attrbute value, so long as the approprate counts are mantaned. In contrast, many other ncremental learnng algorthms, such as lnear classfers wth weghts learned by stochastc gradent descent, requre that the number of attrbute values remans constant and that ther meanngs do not change. In the current work we target algorthms that requre the number of bns and ther meanngs to be nvarant. Partton Incremental Dscretzaton (PID) [10] allows the number of ntervals to reman constant. It operates by formng two layers of dscretzaton. The top layer s the dscretzaton used by the learnng algorthm. The bottom layer contans many more bns than the top layer. In ther example case

4 for equal frequency dscretzaton the bottom layer ams to mantan bns that contan 1/20 the number of nstances requred by each bn at the top level. Top level bns are formed by aggregaton of consecutve lower-level bns untl approxmately the correct sze bn s obtaned. The lower-level bns are ntally formed by settng cut ponts at equal dstances along the number lne between an ndcatve lower and upper value on the attrbute. Then as the stream s consumed, the counts for the lower-level bns are ncremented as approprate. When a lower-level bn exceeds a threshold sze t s splt on a value md-way between ts mnmum and maxmum values, and each count s set to one half the count for the orgnal bn. Ths may result n some naccuracy n the counts, but such naccuracy only matters when the two parts of a splt lowerlevel nterval end up n dfferent top-level bns, as otherwse both of the bns that have been formed wll fall wthn the one top-level bn and the top-level bn s total count wll reman accurate. The paper does not specfy the threshold for splttng. In our study we use twce the target sze for a lower-level bn. In other words, a lower-level bn s splt n two when t exceeds 1/10 th the target sze for an upper-level bn. PID has three potental lmtatons. Frst, as lower-level bns move from one hgher-level bn to another, there mght be abrupt changes n the cut ponts from one update to the next. Second, f the spread of values on the number lne s not unform, the number of bns created may become very large. Ths s because a small number of ntal bns may need to be repeatedly splt to accommodate the majorty of the data. Thrd, the splttng process mght result n major naccuraces n the estmated counts when there are very large numbers of repettons of a sngle value v. In ths case the lower-level bn nto whch v falls wll rapdly grow to exceed the sze threshold and be splt. However, the dvson of the counts across the two resultng bns wll be naccurate, as all the repettons of v belong n the same bn but wll be attrbuted equally to each of the new bns. Ths may occur repeatedly, causng a dmnshngly small proporton of the true count for v to be allocated to the correct bn. VI. EVALUATION We seek to evaluate three prmary contrbutons 1) IDA, a new algorthm for effcent and effectve dscretzaton of streamng data that approxmates the mantenance of equal frequency dscretzaton over all of the data observed up to the current tme; 2) IDAW, a varant of IDA that seeks to mantan an equal frequency dscretzaton over the data dstrbuton at the current tme; and 3) the hypothess that dscretzaton based on quantles can allow the cut ponts to drft over tme wthout changng the relevant meanng of the ntervals. It s also mportant to understand exactly how much power s lost by performng ncremental rather than batch dscretzaton. To assess these contrbutons and ssues we evaluate each component of our new algorthms n turn. We frst compare dscretzaton usng the full data (Pre- Dsc) aganst dscretzaton usng all the data encountered up to the tme of classfcaton (All-So-Far). Note that both Pre- Dsc and All-So-Far set hypothetcal benchmarks. Nether s feasble n a real-world streamng data context because the frst requres seeng all data that wll ever come through the stream n advance and the second requres retanng and analyzng all data n the stream. The next relevant test s to assess the loss n accuracy that results from usng a random sample rather than dscretzng on all the data encountered to date. To ths end we compare IDA aganst All-So-Far. It s also mportant to compare aganst the current state-ofthe-art n ncremental dscretzaton, PID. Ths s the only pror ncremental dscretzaton technque capable of supportng equal frequency dscretzaton. PID requres that the user provdes an ntal estmate of the lkely mnmum and maxmum value for each attrbute. To ensure that the evaluaton s as favorable to PID as possble we use the true mnmum and maxmum n place of these estmates, values that are often not known n practce for real streamng data. In order to understand what advantage, f any, dscretzaton can confer, we compare LR wth IDA to LR performed on normalzed numerc data (No-Dsc). To ensure that ths comparson s as favorable as possble to the non-dscretzaton opton we normalze usng the mnmum and maxmum values of each attrbute n the data, replacng each value x wth 2(x X )/( X X ) 1.0, (1) where X and X denote respectvely the mnmum and maxmum values for the attrbute X. Ths normalzes values to the nterval [ 1.0, 1.0]. Such normalzaton would clearly often not be possble n practce wth streamng data because t s often not possble to know n advance the mnmum and maxmum values for an attrbute. We also wsh to nvestgate the dea of allowng the dscretzaton to drft over tme, closely trackng the current dstrbuton of values. To ths end we compare IDAW to IDA usng sample szes of We perform all experments usng LR wth sngle-pass Stochastc Gradent Descent (LRSGD) usng regularzaton rate µ = and learnng rate or step sze λ = These are rates that we have found to be effectve n prevous expermental work on the current datasets when not usng dscretzaton. The regularzaton rate s not reduced over tme as we are seekng to learn n the presence of dstrbuton and concept drft and hence the target s non-statonary and so we cannot assume that we are ever approachng a fxed optmum. LRSGD has been selected as an exemplar of the type of ncremental learnng algorthm normally assocated wth numerc data that we beleve may beneft from dscretzaton. All experments use the procedure outlned n Table III. We use 5 bn dscretzaton because 10 bns obtaned the same overall results and 5 bns provded the best results for Pre-Dsc. A. Comparsons wthout dstrbuton or concept drft We are presentng a new approach to dscretzaton. Whle t s desgned for use wth streamng data t s mportant to establsh how much accuracy s lost relatve to the non-streamng baselne n a stuaton where there s clearly no dstrbuton or concept drft. To ths end we perform experments where the data are shuffled to ensure there s no systematc drft

5 TABLE III. STREAM LEARNING PROCEDURE 1: procedure STREAMTEST(data stream: S, dscretzer:, learner: λ) 2: ntalze the dscretzaton δ as requred by 3: for = 1 to S do 4: update δ by applyng (δ, x ) 5: apply the learner, ŷ = λ(δ(x )) 6: record the error I(ŷ y ) 7: end for 8: end procedure Fg. 1. Pre Dsc All So Far IDA PID IDAW No Dsc 0-1 loss on data wthout dstrbuton drft sensor power supply arlnes electrcty gas sensor over tme and compare our streamng algorthms aganst predscretzaton usng all the data and a streamng dscretzaton that uses all the data up to the current pont of tme. 20 experments were performed for each data stream, each tme shufflng the data n advance. We use the only publc real-world stream classfcaton datasets of whch we are aware, arlnes and electrcty, obtaned from the MOA webste [12]; gas-sensor, obtaned from the UCI Repostory [13]; and power-supply and sensor, obtaned from the Stream Data Mnng Repostory [14]. We present the resultng mean 0-1 loss for each algorthm on each dataset n Fg. 1, wth error bars representng 1 standard devaton marked, but too close to be readly dscerned. We use two-taled match-par t-tests for sgnfcance, employng an adjusted crtcal value of 0.05/75 = after a Bonferonn correcton for the 75 comparsons performed (15 pars of algorthms tmes 5 datasets). In no case s there a sgnfcant dfference between the error of the dscretzaton technques on arlnes, electrcty, gas-sensor or power-supply (p = to ). On sensor, IDAW has sgnfcantly hgher error than the other dscretzaton technques (p = to ). Ths may be due to the nstablty of the quantles as they are contnually updated. On all streams the use of LR wthout dscretzaton results n hgher error than ts use wth any dscretzaton technque (p = to ). These results demonstrate that our computatonally effcent use of small samples provdes performance that s close to optmal n the absence of concept drft. B. Comparsons on real world data To establsh the value of our algorthms n the context of dstrbuton drft, t s useful to assess performance on realworld stream data. Our fnal study compares the algorthms on the real-world data used n the prevous experment Fg. 2. Pre Dsc All So Far IDA PID IDAW No Dsc Errors for each approach on each data stream power supply sensor arlnes electrcty gas sensor We process our real-world datasets n ther orgnal order. Because they are n a fxed order t s not possble to have repeated trals and hence not possble to perform statstcal tests. In consequence, one should be cautous n nterpretng the apparent dfferences as meanngful unless they are qute substantal. The 0-1 loss s presented n Fgure 2. All the dscretzaton technques appear to enjoy a substantal advantage relatve to no dscretzaton on all data streams other than arlnes for whch the advantage s small. Mantanng an exact dscretzaton over all the data to the current pont offers smlar accuracy to pre-dscretzaton on all datasets except sensor for whch t appears to substantally ncrease error. It s not apparent why All-So-Far should be penalzed on ths partcular data stream. The two approaches that seek to approxmate All-So-Far, IDA and PID, both acheve error very close to ts error. The IDAW approach of trackng the current dstrbuton delvers very substantal reductons n error for the electrcty and sensor data streams, but results n substantal ncreases n error for gas-sensor and power-supply. The beneft of ths approach on the electrcty and sensor data streams supports our hypothess that mantanng dscretzatons based on quantles as they vary over tme can mantan meanng whle the cut-ponts vary. However, the results for the other data streams show that some types of dstrbuton drft do not take ths form. C. Runnng tmes Due to the large number of repettons of processng large datasets we conducted all experments on a heterogeneous grd system. As a result, compute tmes are only ndcatve at best. Nonetheless we present n Fgure 3 the compute tmes for the experments on real-world data n order to gve a feel for the computatonal profles of the technques that we have developed. The software s mplemented n C++ but lttle attempt has been made to optmze the dscretzaton process. The key observaton s that IDA and ts varants n most cases ncur only modest computatonal overheads relatve to no dscretzaton. The relatvely poor performance of PID should be treated wth cauton as we have made no attempt to optmze our remplementaton of the technque.

6 IDA IDAW PID arlnes electrcty gas-sensor power-supply sensor Fg. 3. Runnng tmes for each ncremental dscretzaton technque on each data stream, presented n multples of tme taken wthout dscretzaton VII. CONCLUSIONS We have explored the key ssues that surround dscretzaton of streamng data and presented two new technques based on samplng. Most dscrmnatve algorthms requre that the number and meanng of the bns reman nvarant. We argue that bnnng on fxed quantles of the dstrbuton, rather than fxed absolute values, can mantan an approprate meanng over streamng data wth dstrbuton drft. Hence one bn can represent the top p values for the data and so on, even as the absolute values n that range vary. Our new stream dscretzaton technques use a sample of values for an attrbute to mantan an equal frequency dscretzaton. They dffer only n the composton of the sample. IDA uses the reservor samplng algorthm to mantan a sample drawn unformly at random from the entre stream up untl the current tme. Ths approxmates the mantenance of an equal frequency dscretzaton over the entre stream up to the current pont. Its desrable features nclude nvolvng neglgble computaton once the stream becomes large, as updates to the sample become very rare. Our results show that t s very effectve n the absence of concept drft and can substantally reduce the error of LRSGD. Even wth a very small sample t only ncreases the error very modestly compared wth equal frequency dscretzaton over all data n the stream to date. IDAW s a varant of IDA that s useful when t s desrable to more closely track the current dstrbuton of the data. IDAW mantans a wndow of the most recent values for an attrbute and dscretzes these. Ths approach ncurs greater computaton than IDA, as the sample must be updated at every tme step. Further, the values must be mantaned n two orders, value order to support dscretzaton and tme order to allow mantenance of the wndow. Nonetheless we show that ths addtonal computatonal burden can delver substantal beneft n the context of ncremental concept drft. It remans an open topc for future research whether t s possble to dentfy when drftng dscretzaton such as that provded by IDAW s approprate and when non-drftng dscretzaton such as provded by IDA wll be more effectve. The computatonal burden of IDAW could be greatly reduced n contexts where the rate of expected drft relatve to the rate at whch objects arrve s low, by only updatng wth occasonal randomly selected objects. We conducted our experments usng LRSGD. We have shown that wth ths learner dscretzaton can delver substantal reductons n error relatve to learnng from undscretzed data. Ths s not to clam that more sophstcated treatment of undscretzed data could not acheve even better results. Our objectve s to show that dscretzaton s a practcal addton to the streamng data toolbox whch s worthy of consderaton, rather than to argue that t provdes unversal beneft. Whle our research has only consdered classfcaton learnng from stream data, dscretzaton s lkely to also prove valuable for other data mnng actvtes on data streams ncludng temset mnng [15] and clusterng [16]. We leave t to future research to explore the potental benefts of dscretzaton n these contexts and the relatve merts of alternatve stream dscretzaton strateges. It s a surprsng gap n the data mnng lterature that relatvely lttle has been done on dscretzaton for streamng data. Perhaps the greatest contrbuton of ths paper s to have shown that t can be done n a computatonally effcent manner and that t can delver substantal value. The executable bnares, scrpts, datasets and nstructons requred to replcate the experments can be downloaded from webb/software/ ncremental-dscretzaton.tgz. REFERENCES [1] G. I. Webb, Nave Bayes, n Encyclopeda of Machne Learnng, C. Sammut and G. I. Webb, Eds. Sprnger, 2011, pp [2] Y. Yang and G. I. Webb, Dscretzaton for nave-bayes learnng: Managng dscretzaton bas and varance, Machne Learnng, vol. 74, no. 1, pp , [3] S. Garca, J. Luengo, J. Saez, V. Lopez, and F. Herrera, A survey of dscretzaton technques: Taxonomy and emprcal analyss n supervsed learnng, IEEE Transactons on Knowledge and Data Engneerng, vol. 25, no. 4, pp , Aprl [4] A. Stuart and J. K. Ord, Kendall s Advanced Theory of Statstcs, 6th ed. Edward Arnold, [5] J. S. Vtter, Random samplng wth a reservor, ACM Trans. Mathematcal Software, vol. 11, no. 1, pp , [6] J. van Leeuwen and D. Wood, Interval heaps, The Computer Journal, vol. 36, no. 3, pp , [7] S. Guha and A. McGregor, Stream order and order statstcs: Quantle estmaton n random-order streams, SIAM Journal on Computng, vol. 38, no. 5, pp , [8] A. Gupta and F. X. Zane, Countng nversons n lsts, n Proc. Fourteenth Annual ACM-SIAM Symp. Dscrete Algorthms, ser. SODA 03, 2003, pp [9] T. Elomaa and P. Lehtnen, Mantanng optmal mult-way splts for numercal attrbutes n data streams, n PAKDD08, 2008, pp [10] J. Gama and C. Pnto, Dscretzaton from data streams: applcatons to hstograms and data mnng, n Proc ACM Symp. Appled Computng. ACM, 2006, pp [11] J. Lu, Y. Yang, and G. I. Webb, Incremental dscretzaton for nave- Bayes classfer, n Proc. 2nd Int. Conf. Advanced Data Mnng and Applcatons (ADMA 2006). Sprnger, 2006, pp [12] MOA, [Onlne]. Avalable: [13] K. Bache and M. Lchman, UCI machne learnng repostory, [Onlne]. Avalable: [14] X. Xu, Stream data mnng repostory, [Onlne]. Avalable: xqzhu/stream.html [15] N. Jang and L. Gruenwald, Cf-stream: mnng closed frequent temsets n data streams, n ACM SIGKDD-06. ACM, 2006, pp [16] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A framework for clusterng evolvng data streams, n Proceedngs of the 29th Internatonal Conference on Very Large Data Bases-Volume 29, 2003, pp

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Parameter estimation for incomplete bivariate longitudinal data in clinical trials Parameter estmaton for ncomplete bvarate longtudnal data n clncal trals Naum M. Khutoryansky Novo Nordsk Pharmaceutcals, Inc., Prnceton, NJ ABSTRACT Bvarate models are useful when analyzng longtudnal data

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Self-tuning Histograms: Building Histograms Without Looking at Data

Self-tuning Histograms: Building Histograms Without Looking at Data Self-tunng Hstograms: Buldng Hstograms Wthout Lookng at Data Ashraf Aboulnaga Computer Scences Department Unversty of Wsconsn - Madson ashraf@cs.wsc.edu Surajt Chaudhur Mcrosoft Research surajtc@mcrosoft.com

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Data Mining: Model Evaluation

Data Mining: Model Evaluation Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms Desgn and Analyss of Algorthms Heaps and Heapsort Reference: CLRS Chapter 6 Topcs: Heaps Heapsort Prorty queue Huo Hongwe Recap and overvew The story so far... Inserton sort runnng tme of Θ(n 2 ); sorts

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010 Smulaton: Solvng Dynamc Models ABE 5646 Week Chapter 2, Sprng 200 Week Descrpton Readng Materal Mar 5- Mar 9 Evaluatng [Crop] Models Comparng a model wth data - Graphcal, errors - Measures of agreement

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science EECS 730 Introducton to Bonformatcs Sequence Algnment Luke Huan Electrcal Engneerng and Computer Scence http://people.eecs.ku.edu/~huan/ HMM Π s a set of states Transton Probabltes a kl Pr( l 1 k Probablty

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

EXTENDED BIC CRITERION FOR MODEL SELECTION

EXTENDED BIC CRITERION FOR MODEL SELECTION IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated. Some Advanced SP Tools 1. umulatve Sum ontrol (usum) hart For the data shown n Table 9-1, the x chart can be generated. However, the shft taken place at sample #21 s not apparent. 92 For ths set samples,

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs46.stanford.edu /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu Perceptron: y = sgn( x Ho to fnd

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE 1 ata Structures and Algorthms Chapter 4: Trees BST Text: Read Wess, 4.3 Izmr Unversty of Economcs 1 The Search Tree AT Bnary Search Trees An mportant applcaton of bnary trees s n searchng. Let us assume

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

A Statistical Model Selection Strategy Applied to Neural Networks

A Statistical Model Selection Strategy Applied to Neural Networks A Statstcal Model Selecton Strategy Appled to Neural Networks Joaquín Pzarro Elsa Guerrero Pedro L. Galndo joaqun.pzarro@uca.es elsa.guerrero@uca.es pedro.galndo@uca.es Dpto Lenguajes y Sstemas Informátcos

More information

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap Int. Journal of Math. Analyss, Vol. 8, 4, no. 5, 7-7 HIKARI Ltd, www.m-hkar.com http://dx.do.org/.988/jma.4.494 Emprcal Dstrbutons of Parameter Estmates n Bnary Logstc Regresson Usng Bootstrap Anwar Ftranto*

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Fast Feature Value Searching for Face Detection

Fast Feature Value Searching for Face Detection Vol., No. 2 Computer and Informaton Scence Fast Feature Value Searchng for Face Detecton Yunyang Yan Department of Computer Engneerng Huayn Insttute of Technology Hua an 22300, Chna E-mal: areyyyke@63.com

More information

Random Kernel Perceptron on ATTiny2313 Microcontroller

Random Kernel Perceptron on ATTiny2313 Microcontroller Random Kernel Perceptron on ATTny233 Mcrocontroller Nemanja Djurc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA nemanja.djurc@temple.edu Slobodan Vucetc Department

More information

Petri Net Based Software Dependability Engineering

Petri Net Based Software Dependability Engineering Proc. RELECTRONIC 95, Budapest, pp. 181-186; October 1995 Petr Net Based Software Dependablty Engneerng Monka Hener Brandenburg Unversty of Technology Cottbus Computer Scence Insttute Postbox 101344 D-03013

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning Computer Anmaton and Vsualsaton Lecture 4. Rggng / Sknnng Taku Komura Overvew Sknnng / Rggng Background knowledge Lnear Blendng How to decde weghts? Example-based Method Anatomcal models Sknnng Assume

More information

A Multivariate Analysis of Static Code Attributes for Defect Prediction

A Multivariate Analysis of Static Code Attributes for Defect Prediction Research Paper) A Multvarate Analyss of Statc Code Attrbutes for Defect Predcton Burak Turhan, Ayşe Bener Department of Computer Engneerng, Bogazc Unversty 3434, Bebek, Istanbul, Turkey {turhanb, bener}@boun.edu.tr

More information

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007 Syntheszer 1.0 A Varyng Coeffcent Meta Meta-Analytc nalytc Tool Employng Mcrosoft Excel 007.38.17.5 User s Gude Z. Krzan 009 Table of Contents 1. Introducton and Acknowledgments 3. Operatonal Functons

More information

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem An Effcent Genetc Algorthm wth Fuzzy c-means Clusterng for Travelng Salesman Problem Jong-Won Yoon and Sung-Bae Cho Dept. of Computer Scence Yonse Unversty Seoul, Korea jwyoon@sclab.yonse.ac.r, sbcho@cs.yonse.ac.r

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,

More information

Local Quaternary Patterns and Feature Local Quaternary Patterns

Local Quaternary Patterns and Feature Local Quaternary Patterns Local Quaternary Patterns and Feature Local Quaternary Patterns Jayu Gu and Chengjun Lu The Department of Computer Scence, New Jersey Insttute of Technology, Newark, NJ 0102, USA Abstract - Ths paper presents

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Analysis of Malaysian Wind Direction Data Using ORIANA

Analysis of Malaysian Wind Direction Data Using ORIANA Modern Appled Scence March, 29 Analyss of Malaysan Wnd Drecton Data Usng ORIANA St Fatmah Hassan (Correspondng author) Centre for Foundaton Studes n Scence Unversty of Malaya, 63 Kuala Lumpur, Malaysa

More information

Backpropagation: In Search of Performance Parameters

Backpropagation: In Search of Performance Parameters Bacpropagaton: In Search of Performance Parameters ANIL KUMAR ENUMULAPALLY, LINGGUO BU, and KHOSROW KAIKHAH, Ph.D. Computer Scence Department Texas State Unversty-San Marcos San Marcos, TX-78666 USA ae049@txstate.edu,

More information

Meta-Prediction for Collective Classification

Meta-Prediction for Collective Classification McDowell, L.., Gupta,.M., & Aha, D.W. (200). Meta-predcton n collectve classfcaton. To appear n Proceedngs of the Twenty- Thrd Florda Artfcal Intellgence Research Socety Conference. Daytona Beach, FL:

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis C2 Tranng: June 8 9, 2010 Introducton to meta-analyss The Campbell Collaboraton www.campbellcollaboraton.org Combnng effect szes across studes Compute effect szes wthn each study Create a set of ndependent

More information

Help for Time-Resolved Analysis TRI2 version 2.4 P Barber,

Help for Time-Resolved Analysis TRI2 version 2.4 P Barber, Help for Tme-Resolved Analyss TRI2 verson 2.4 P Barber, 22.01.10 Introducton Tme-resolved Analyss (TRA) becomes avalable under the processng menu once you have loaded and selected an mage that contans

More information

GSLM Operations Research II Fall 13/14

GSLM Operations Research II Fall 13/14 GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System Fuzzy Modelng of the Complexty vs. Accuracy Trade-off n a Sequental Two-Stage Mult-Classfer System MARK LAST 1 Department of Informaton Systems Engneerng Ben-Guron Unversty of the Negev Beer-Sheva 84105

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information