arxiv: v1 [cs.db] 15 Jan 2016

Size: px
Start display at page:

Download "arxiv: v1 [cs.db] 15 Jan 2016"

Transcription

1 ActveClean: Interactve Data Cleanng Whle Learnng Convex Loss Models Sanjay Krshnan, Jannan Wang, Eugene Wu, Mchael J. Frankln, Ken Goldberg UC Berkeley, Columba Unversty {sanjaykrshnan, jnwang, frankln, arxv: v1 [cs.db] 15 Jan 216 ABSTRACT Data cleanng s often an mportant step to ensure that predctve models, such as regresson and classfcaton, are not affected by systematc errors such as nconsstent, out-of-date, or outler data. Identfyng drty data s often a manual and teratve process, and can be challengng on large datasets. However, many data cleanng workflows can ntroduce subtle bases nto the tranng processes due to volaton of ndependence assumptons. We propose ActveClean, a progressve cleanng approach where the model s updated ncrementally nstead of re-tranng and can guarantee accuracy on partally cleaned data. ActveClean supports a popular class of models called convex loss models (e.g., lnear regresson and SVMs). ActveClean also leverages the structure of a user s model to prortze cleanng those records lkely to affect the results. We evaluate ActveClean on fve real-world datasets UCI Adult, UCI EEG, MNIST, Dollars For Docs, and World- Bank wth both real and synthetc errors. Our results suggest that our proposed optmzatons can mprove model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fxed cleanng budget and on all real drty datasets, ActveClean returns more accurate models than unform samplng and Actve Learnng. 1. INTRODUCTION Machne Learnng on large and growng datasets s a key data management challenge wth sgnfcant nterest n both ndustry and academa [1, 5, 1, 2]. Despte a number of breakthroughs n reducng tranng tme, predctve modelng can stll be a tedous and tme-consumng task for an analyst. Data often arrve drty, ncludng mssng, ncorrect, or nconsstent attrbutes, and analysts wdely report that data cleanng and other forms of pre-processng account for up to 8% of ther effort [3, 22]. Whle data cleanng s an extensvely studed problem, the predctve modelng settng poses a number of new challenges: (1) hgh dmensonalty can amplfy even a small amount of erroneous records [36], (2) the complexty can make t dffcult to trace the consequnces of an error, and (3) there are often subtle techncal ccondtons (e.g., ndependent and dentcally dstrbuted) that can be volated by data cleanng. Consequently, technques that have been desgned for tradtonal SQL analytcs may be neffcent or even unrelable. In ths paper, we study the relatonshp between data cleanng and model tranng workflows and explore how to apply exstng data cleanng approaches wth provable guarantees. One of the man bottlenecks n data cleanng s the human effort n determnng whch data are drty and then developng rules or software to correct the problems. For some types of drty data, such as nconsstent values, model tranng may seemngly succeed albet, wth potental subtle naccuraces n the model. For example, battery-powered sensors can transmt unrelable measurements when battery levels are low [21]. Smlarly, data entered by humans can be susceptble to a varety of nconsstences (e.g., typos), and unntentonal cogntve bases [23]. Such problems are often addressed n tme-consumng loop where the analys trans a model, nspects the model and ts predctons, clean some data, and re-tran. Ths teratve process s the de facto standard, but wthout approprate care, can lead to several serous statstcal ssues. Due to the well-known Smpson s paradox, models traned on a mx of drty and clean data can have very msleadng results even n smple scenaros (Fgure 1). Furthermore, f the canddate drty records are not dentfed wth a known samplng dstrbuton, the statstcal ndependence assumptons for most tranng methods are volated. The volatons of these assumptons can ntroduce confoundng bases. To ths end, we desgned ActveClean whch trans predctve models whle allowng for teratve data cleanng and has accuracy guarantees. ActveClean automates the drty data dentfcaton process and the model update process, thereby abstractng these two error-prone steps away from the analyst. ActveClean s nspred by the recent success of progressve data cleanng where a user can gradually clean more data untl the desred accuracy s reached [6, 34, 3, 17, 26, 38, 37]. We focus on a popular class of models called convex loss models (e.g., ncludes lnear regresson and SVMs) and show that the Smpson s paradox problem can be avoded usng teratve mantenance of a model rather than re-tranng. Ths process leverages the convex structure of the model rather than treatng t lke a black-box, and we apply convergence arguments from convex optmzaton theory. We propose several novel optmzatons that leverage nformaton from the model to gude data cleanng towards the records most lkely to be drty and most lkely to affect the results. To summarze the contrbutons: Correctness (Secton 5). We show how to update a drty model gven newly cleaned data. Ths update converges monotoncally n expectaton. For a batch sze b and teratons T, t converges wth rate O( 1 bt ). Effcency (Secton 6). We derve a theoretcal optmal samplng dstrbuton that mnmzes the update error and an approxmaton to estmate the theoretcal optmum. Detecton and Estmaton (Secton 7). We show how

2 ActveClean can be ntegrated wth data detecton to gude data cleanng towards records expected to be drty. The experments evaluate these components on four datasets wth real and synthetc corrupton (Secton 8). Results suggests that for a fxed cleanng budget, ActveClean returns more accurate models than unform samplng and Actve Learnng when systematc corrupton s sparse. 2. BACKGROUND AND PROBLEM SETUP Ths secton formalzes the teratve data cleanng and tranng process and hghlghts an example applcaton. 2.1 Predctve Modelng The user provdes a relaton R and wshes to tran a model usng the data n R. Ths work focuses on a class of wellanalyzed predctve analytcs problems; ones that can be expressed as the mnmzaton of convex loss functons. Convex loss mnmzaton problems are amenable to a varety of ncremental optmzaton methodologes wth provable guarantees (see Fredman, Haste, and Tbshran [15] for an ntroducton). Examples nclude generalzed lnear models (ncludng lnear and logstc regresson), support vector machnes, and n fact, means and medans are also specal cases. We assume that the user provdes a featurzer F ( ) that maps every record r R to a feature vector x and label y. For labeled tranng examples {(x, y )} N =1, the problem s to fnd a vector of model parameters θ by mnmzng a loss functon φ over all tranng examples: θ = arg mn θ N φ(x, y, θ) =1 Where φ s a convex functon n θ. For example, n a lnear regresson φ s: φ(x, y, θ) = θ T x y 2 2 Typcally, a regularzaton term r(θ) s added to ths problem. r(θ) penalzes hgh or low values of feature weghts n θ to avod overfttng to nose n the tranng examples. θ = arg mn θ N φ(x, y, θ) + r(θ) (1) =1 In ths work, wthout loss of generalty, we wll nclude the regularzaton as part of the loss functon.e., φ(x, y, θ) ncludes r(θ). 2.2 Data Cleanng We consder corrupton that affects the attrbute values of records. Ths does not cover errors that smultaneously affect multple records such as record duplcaton or structure such as schema transformaton. Examples of supported cleanng operatons nclude, batch resolvng common nconsstences (e.g., mergng U.S.A" and Unted States"), flterng outlers (e.g., removng records wth values > 1e6), and standardzng attrbute semantcs (e.g., 1.2 mles" and 1.93 km"). We are partcularly nterested n those errors that are dffcult or tme-consumng to clean, and requre the analyst to examne an erroneous record, and determne the approprate acton possbly leveragng knowledge of the current best model. We represent ths operaton as Clean( ) whch can be appled to a record r (or a set of records) to recover the clean record r = Clean(r). Formally, we treat the Clean( ) as an expensve user-defned functon composed of determnstc schema-preservng map and flter operatons appled to a subset of rows n the relaton. A relaton s defned as clean f R clean = Clean(R clean ). Therefore, for every r R clean there exsts a unque r R n the drty data. The map and flter cleanng model s not a fundamental restrcton of Actve- Clean, and Appendx A dscusses a compatble set of records" cleanng model. 2.3 Iteraton As an example of how Clean( ) fts nto an teratve analyss process, consder an analyst tranng a regresson and dentfyng outlers. When she examnes one of the outlers, she realzes that the base data (pror to featurzaton) has a formattng nconsstency that leads to ncorrect parsng of the numercal values. She apples a batch fx (.e., Clean( )) to all of the outlers wth the same error, and re-trans the model. Ths teratve process can be descrbed as the followng pseudocode loop: 1. Int(ter) 2. current_model = Tran(R) 3. For each t n {1,..., ter} (a) drty_sample = Identfy(R,current_model) (b) clean_sample = Clean(drty_sample) (c) current_model = Update(clean_sample, R) 4. Output: current_model Whle we have already dscussed T ran( ) and Clean( ), the analyst stll has to defne the prmtves Identfy( ) and Update( ). For Identfy( ), gven a the current best model, the analyst must specfy some crtera to select a set of records to examne. And n Update( ), the analyst must decde how to update the model gven newly cleaned data. It turns out that these prmtves are not trval to mplement snce the straght-forward solutons can actually lead to dvergence of the traned models. 2.4 Challenges Correctness: Let us assume that the analyst has mplemented an Identfy( ) functon that returns k canddate drty records. The straght-forward applcaton data cleanng s to repar the corrupton n place, and re-tran the model after each repar. Suppose k N records are cleaned, but all of the remanng drty records are retaned n the dataset. Fgure 1 hghlghts the dangers of ths approach on a very smple drty dataset and a lnear regresson model.e., the best ft lne for two varables. One of the varables s systematcally corrupted wth a translaton n the x-axs (Fgure 1a). The drty data s marked n red and the clean data n blue, and they are shown wth ther respectve best ft lnes. After cleanng only two of the data ponts (Fgure 1b), the resultng best ft lne s n the opposte drecton of the true model. Aggregates over mxtures of dfferent populatons of data can result n spurous relatonshps due to the well-known phenomenon called Smpson s paradox [32]. Smpson s paradox s by no means a corner case, and t has affected the valdty of a number of hgh-profle studes [35]; even n the smple case of takng an average over a dataset. Predctve models are hgh-dmensonal generalzatons of these aggregates wthout closed form technques to compensate for these

3 bases. Thus, tranng models on a mxture of drty and clean data can lead to unrelable results, where artfcal trends ntroduced by the mxture can be confused for the effects of data cleanng. Fgure 1: (a) Systematc corrupton n one varable can lead to a shfted model. (b) Mxed drty and clean data results n a less accurate model than no cleanng. (c) Small samples of only clean data can result n smlarly naccurate models. An alternatve s to avod the drty data altogether nstead of mxng the two populatons, and the model re-tranng s restrcted to only data that are known to be clean. Ths approach s smlar to SampleClean [33], whch was proposed to approxmate the results of aggregate queres by applyng them to a clean sample of data. However, hgh-dmensonal models are hghly senstve to sample sze. Fgure 1c llustrates that, even n two dmensons, models traned from small samples can be as ncorrect as the mxng soluton descrbed before. Effcency: Conversely, hypothetcally assume that the analyst has mplemented a correct Update( ) prmtve and mplements Identfy( ) wth a technque such as Actve Learnng to select records to clean [37, 38, 16]. Actve learnng s a technque to carefully select the set of examples to learn the most accurate model. However, these selecton crtera are desgned for statonary data dstrbutons, an assumpton whch s not true n ths settng. As more data are cleaned, the data dstrbuton changes. Data whch may look unmportant n the drty data mght be very valuable to clean n realty, and thus any prortzaton has to predct a record s value wth respect to an antcpated clean model. 2.5 The Need For Automaton ActveClean s a framework that mplements the Identfy( ) and Update( ) prmtves for the analyst. By automatng the teratve process, ActveClean ensures relable models wth convergence guarantees. The analyst frst ntalzes Actve- Clean wth a drty model. ActveClean carefuly selects small batches of data to clean based on data that are lkely to be drty and lkely to affect the model. The analyst apples data cleanng to these batches, and ActveClean updates the model wth an ncremental optmzaton technque. Machne learnng has been appled n pror work to mprove the effcency of data cleanng [37, 38, 16]. Human nput, ether for cleanng or valdaton of automated cleanng, s often expensve and mpractcal for large datasets. A model can learn rules from a small set of examples cleaned (or valdated) by a human, and actve learnng s a technque to carefully select the set of examples to learn the most accurate model. Ths model can be used to extrapolate repars to not-yet-cleaned data, and the goal of these approaches s to provde the cleanest possble dataset ndependent of the subsequent analytcs or query processng. These approaches, whle very effectve, suffer from composblty problems when placed nsde cleanng and tranng loops. To summarze, ActveClean consders data cleanng durng model tranng, whle these technques consder model tranng for data cleanng. One of the prmary contrbutons of ths work s an ncremental model update algorthm wth correctness guarantees for mxtures of data. 2.6 Use Case: Dollars for Docs [2] ProPublca collected a dataset of corporate donatons to doctors to analyze conflcts of nterest. They reported that some doctors receved over $5, n travel, meals, and consultaton expenses [4]. ProPublca laborously curated and cleaned a dataset from the Centers for Medcare and Medcad Servces that lsted nearly 25, research donatons, and aggregated these donatons by physcan, drug, and pharmaceutcal company. We collected the raw unaggregated data and explored whether suspect donatons could be predcted wth a model. Ths problem s typcal of analyss scenaros based on observatonal data seen n fnance, nsurance, medcne, and nvestgatve journalsm. The dataset has the followng schema: C o n t r b u t o n ( p _ s p e c a l t y, drug_name, devce_name, c o r p o r a t o n, amount, d s p u t e, s t a t u s ) p_specalty s a textual attrbute descrbng the specalty of the doctor recevng the donaton. drug_name s the branded name of the drug n the research study (null f not a drug). devce_name s the branded name of the devce n the study (null f not a devce). corporaton s the name of the pharmaceutcal provdng the donaton. amount s a numercal attrbute representng the donaton amount. dspute s a Boolean attrbute descrbng whether the research was dsputed. status s a strng label descrbng whether the donaton was allowed under the declared research protocol. The goal s to predct dsallowed donaton. However, ths dataset s very drty, and the systematc nature of the data corrupton can result n an naccurate model. On the ProPublca webste [2], they lst numerous types of data problems that had to be cleaned before publshng the data (see Appendx I). For example, the most sgnfcant donatons were made by large companes whose names were also more often nconsstently represented n the data, e.g., Pfzer Inc.", Pfzer Incorporated", Pfzer". In such scenaros, the effect of systematc error can be serous. Duplcate representatons could artfcally reduce the correlaton between these enttes and suspected contrbutons. There were nearly 4, of the 25, records that had ether namng nconsstences or other nconsstences n labelng the allowed or dsallowed status. Wthout data cleanng, the detecton rate usng a Support Vector Machne was 66%. Applyng the data cleanng to the entre dataset mproved ths rate to 97% n the clean data (Secton 8.6.1), and the experments descrbe how ActveClean can acheve an 8% detecton rate for less than 1.6% of the records cleaned. 3. PROBLEM FORMALIZATION Ths secton formalzes the problems addressed n the paper. 3.1 Notaton and Setup

4 The user provdes a relaton R, a cleaner C( ), a featurzer F ( ), and a convex loss problem defned by the loss φ( ). A total of k records wll be cleaned n batches of sze b, so there wll be k teratons. We use the followng notaton to represent relevant ntermedate b states: Drty Model: θ (d) s the model traned on R (wthout cleanng) wth the featurzer F ( ) and loss φ( ). Ths serves as an ntalzaton to ActveClean. Drty Records: R drty R s the subset of records that are stll drty. As more data are cleaned R drty {}. Clean Records: R clean R s the subset of records that are clean,.e., the complement of R drty. Samples: S s a sample (possbly non-unform but wth known probabltes) of the records R drty. The clean sample s denoted by S clean = C(S). Clean Model: θ (c) s the optmal clean model,.e., the model traned on a fully cleaned relaton. Current Model: θ (t) s the current best model at teraton t {1,..., k }, and b θ() = θ (d). There are two metrcs that we wll use to measure the performance of ActveClean: Model Error. The model error s defned as θ (t) θ (c). Testng Error. Let T (θ (t) ) be the out-of-sample testng error when the current best model s appled to the clean data, and T (θ (c) ) be the test error when the clean model s appled to the clean data. The testng error s defned as T (θ (t) ) T (θ (c) ) 3.2 Problem 1. Correct Update Problem Gven newly cleaned data S clean and the current best model θ (t), the model update problem s to calculate θ (t+1). θ (t+1) wll have some error wth respect to the true model θ (c), whch we denote as: error(θ (t+1) ) = θ (t+1) θ (c) Snce a sample of data are cleaned, t s only meanngful to talk about expected errors. We call the update algorthm relable" f the expected error s upper bounded by a monotoncally decreasng functon µ of the amount of cleaned data: E(error(θ new )) = O(µ( S clean )) Intutvely, relable" means that more cleanng should mply more accuracy. The Correct Update Problem s to relably update the model θ (t) wth a sample of cleaned data. 3.3 Problem 2. Effcency Problem The effcency problem s to select S clean such that the expected error E(error(θ (t) )) s mnmzed. ActveClean uses prevously cleaned data to estmate the value of data cleanng on new records. Then t draws a sample of records S R drty. Ths s a non-unform sample where each record r has a samplng probablty p(r) based on the estmates. We derve the optmal samplng dstrbuton for the SGD updates, and show how the theoretcal optmum can be approxmated. The Effcency Problem s to select a samplng dstrbuton p( ) over all records such that the expected error w.r.t to the model f traned on fully clean data s mnmzed. 4. ARCHITECTURE Ths secton presents the ActveClean archtecture. 4.1 Overvew Fgure 2 llustrates the ActveClean archtecture. The dotted boxes descrbe optonal components that the user can provde to mprove the effcency of the system Requred User Input Model: The user provdes a predctve model (e.g., SVM) specfed as a convex loss optmzaton problem φ( ) and a featurzer F ( ) that maps a record to ts feature vector x and label y. Cleanng Functon: The user provdes a functon C( ) (mplemented va software or crowdsourcng) that maps drty records to clean records as per our defnton n Secton??. Batches: Data are cleaned n batches of sze b and the user can change these settngs f she desres more or less frequent model updates. The choce of b does affect the convergence rate. Secton 5 dscusses the effcency and convergence trade-offs of dfferent values of b. We emprcally fnd that a batch sze of 5 performs well across dfferent datasets and use that as a default. A cleanng budget k can be used as a stoppng crteron once C( ) has been called k tmes, and so the number of teratons of ActveClean s T = k. Alternatvely, the user can clean data untl the model s of suffcent b accuracy to make a decson Basc Data Flow The system frst trans the model φ( ) on the drty dataset to fnd an ntal model θ (d) that the system wll subsequently mprove. The sampler selects a sample of sze b records from the dataset and passes the sample to the cleaner, whch executes C( ) for each sample record and outputs ther cleaned versons. The updater uses the cleaned sample to update the weghts of the model, thus movng the model closer to the true cleaned model (n expectaton). Fnally, the system ether termnates due to a stoppng condton (e.g., C( ) has been called a maxmum number of tmes k, or tranng error convergence), or passes control to the sampler for the next teraton Optmzatons In many cases, such as mssng values, errors can be effcently detected. A user provded Detector can be used to dentfy such records that are more lkely to be drty, and thus mproves the lkelhood that the next sample wll contan true drty records. Furthermore, the Estmator uses prevously cleaned data to estmate the effect that cleanng a gven record wll have on the model. These components can be used separately (f only one s suppled) or together to focus the system s cleanng efforts on records that wll most mprove the model. Secton 7 descrbes several nstantatons of these components for dfferent data cleanng problems. Our experments show that these optmzatons can mprove model accuracy by up-to 2.5x (Secton 8.3.2). 4.2 Example The followng example llustrates how a user would apply ActveClean to address the use case n Secton 2.6: EXAMPLE 1. The analyst chooses to use an SVM model, and manually cleans records by hand (the C( )). ActveClean ntally selects a sample of 5 records (the default) to show the analyst. She dentfes a subset of 15 records that are drty, fxes them by normalzng the drug and corporaton names wth the

5 or rght). For ths class of models, gven a suboptmal pont, the drecton to the global optmum s the gradent of the loss functon. The gradent s a d-dmensonal vector functon of the current model θ (d) and the clean data. Therefore, Actve- Clean needs to update θ (d) some dstance γ (Fgure 3B): θ new θ (d) γ φ(θ (d) ) Fgure 2: ActveClean allows users to tran predctve models whle progressvely cleanng data. The framework adaptvely selects the best data to clean and can optonally (denoted wth dotted lnes) ntegrate wth predefned detecton rules and estmaton algorthms for mproved conference. help of a search engne, and corrects the labels wth typographcal or ncorrect values. The system then uses the cleaned records to update the the current best model and select the next sample of 5. The analyst can stop at any tme and use the mproved model to predct donaton lkelhoods. 5. UPDATES WITH CORRECTNESS Ths secton descrbes an algorthm for relable model updates. The updater assumes that t s gven a sample of data S drty from R drty where S drty has a known samplng probablty p(). Sectons 6 and 7 show how to optmze p( ) and the analyss n ths secton apples for any samplng dstrbuton p( ) >. 5.1 Geometrc Dervaton The update algorthm ntutvely follows from the convex geometry of the problem. Consder the problem n one dmenson (.e., the parameter θ s a scalar value), so then the goal s to fnd the mnmum pont (θ) of a curve l(θ). The consequence of drty data s that the wrong loss functon s optmzed. Fgure 3A llustrates the consequence of the optmzaton. The red dotted lne shows the loss functon on the drty data. Optmzng the loss functon fnds θ (d) that at the mnmum pont (red star). However, the true loss functon (w.r.t to the clean data) s n blue, thus the optmal value on the drty data s n fact a suboptmal pont on clean curve (red crcle). Fgure 3: (A) A model traned on drty data can be thought of as a sub-optmal pont w.r.t to the clean data. (B) The gradent gves us the drecton to move the suboptmal model to approach the true optmum. The optmal clean model θ (c) s vsualzed as a yellow star. The frst queston s whch drecton to update θ (d) (.e., left At the optmal pont, the magntude of the gradent wll be zero. So ntutvely, ths approach teratvely moves the model downhll (transparent red crcle) correctng the drty model untl the desred accuracy s reached. However, the gradent depends on all of the clean data whch s not avalable and ActveClean wll have to approxmate the gradent from a sample of newly cleaned data. The man ntuton s that f the gradent steps are on average correct, the model stll moves downhll albet wth a reduced convergence rate proportonal to the naccuracy of the sample-based estmate. To derve a sample-based update rule, the most mportant property s that sums commute wth dervatves and gradents. The convex loss class of models are sums of losses, so gven the current best model θ, the true gradent g (θ) s: g (θ) = φ(θ) = 1 N N φ(x (c), y (c), θ) ActveClean needs to estmate g (θ) from a sample S, whch s drawn from the drty data R drty. Therefore, the sum has two components the gradent on the already clean data g C whch can be computed wthout cleanng and g S the gradent estmate from a sample of drty data to be cleaned: g(θ) = R clean R g C(θ) + R drty R g S(θ) g C can be calculated by applyng the gradent to all of the already cleaned records: 1 g C(θ) =, y (c), θ) R clean φ(x (c) R clean g S can be estmated from a sample by takng the gradent w.r.t each record, and re-weghtng the average by ther respectve samplng probabltes. Before takng the gradent the cleanng functon C( ) s appled to each sampled record. Therefore, let S be a sample of data, where each S s drawn wth probablty p(): g S(θ) = 1 1 S p() φ(x(c), y (c), θ) S Then, at each teraton t, the update becomes: θ (t+1) θ (t) γ g(θ (t) ) 5.2 Model Update Algorthm To summarze, the algorthm s ntalzed wth θ () = θ (d) whch s the drty model. There are three user set parameters the budget k, batch sze b, and the step sze γ. In the followng secton, we wll provde references from the convex optmzaton lterature that allow the user to approprately select these values. At each teraton t = {1,..., T }, the cleanng s appled to a batch of data b selected from the set of canddate drty records R drty. Then, an average gradent s estmated from the cleaned batch and the model s updated. Iteratons contnue untl k = T b records are cleaned.

6 1. Calculate the gradent over the sample of newly clean data and call the result g S(θ (t) ) 2. Calculate the average gradent over all of the already clean data n R clean = R R drty, and call the result g C(θ (t) ) 3. Apply the followng update rule: θ (t+1) θ (t) γ ( R drty R g S(θ (t) )+ R clean g C(θ (t) )) R 5.3 Analyss wth Stochastc Gradent Descent The update algorthm can be formalzed as a class of very well studed algorthms called Stochastc Gradent Descent. SGD provdes a theoretcal framework to understand and analyze the update rule and bound the error. Mn-batch stochastc gradent descent (SGD) s an algorthm for fndng the optmal value gven the convex loss and data. In mn-batch SGD, random subsets of data are selected at each teraton and the average gradent s computed for every batch. One key dfference wth tradtonal SGD models s that ActveClean apples a full gradent step on the already clean data and averages t wth a stochastc gradent step (.e., calculated from a sample) on the drty data. Therefore, ActveClean teratons can take multple passes over the clean data but at most a sngle cleanng pass of the drty data. The update algorthm can be thought of as a varant of SGD that lazly materalzes the clean value. As data s sampled at each teraton, data s cleaned when needed by the optmzaton. It s well known that even for an arbtrary ntalzaton SGD makes sgnfcant progress n less than one epoch (a pass through the entre dataset) [9]. In practce, the drty model can be much more accurate than an arbtrary ntalzaton as corrupton may only affect a few features and combned wth the full gradent step on the clean data the updates converge very quckly. Settng the step sze γ: There s extensve lterature n machne learnng for choosng the step sze γ approprately. γ can be set ether to be a constant or decayed over tme. Many machne learnng frameworks (e.g., MLLb, Sc-kt Learn, Vowpal Wabbt) automatcally set learnng rates or provde dfferent learnng schedulng frameworks. In the experments, we use a technque called nverse scalng where there s a parameter γ =.1, and at each teraton t decays to γ t = γ. S t Settng the batch sze b: The batch sze should be set by the user to have the desred propertes. Larger batches wll take longer to clean and wll make more progress towards the clean model but wll have less frequent model updates. On the other hand, smaller batches are cleaned faster and have more frequent model updates. There are dmnshng returns to ncreasng the batch sze O( 1 b ). In the experments, we use a batch sze of 5 whch converges fast but allows for frequent model updates. If a data cleanng technque requres a larger batch sze than 5,.e., data cleanng s fast enough that the teraton overhead s sgnfcant compared to cleanng 5 records, ActveClean can apply the updates n smaller batches. For example, the batch sze set by the user mght be b = 1, but the model updates after every 5 records are cleaned. We can dsassocate the batchng requrements of SGD and the batchng requrements of the data cleanng technque Convergence Condtons and Propertes Convergence propertes of batch SGD formulatons have been well studed [11]. Essentally, f the gradent estmate s unbased and the step sze s approprately chosen, the algorthm s guaranteed to converge. In Appendx B, we show that the gradent estmate from ActveClean s ndeed unbased and our choce of step sze s one that s establshed to converge. The convergence rates of SGD are also well analyzed [11, 8, 39]. The analyss gves a bound on the error of ntermedate models and the expected number of steps before achevng a model wthn a certan error. For a general convex loss, a batch sze b, and T teratons, the convergence rate s bounded by O( σ2 bt ). σ 2 s the varance n the estmate of the gradent at each teraton: E( g g 2 ) where g s the gradent computed over the full data f t were fully cleaned. Ths property of SGD allows us to bound the model error wth a monotoncally decreasng functon of the number of records cleaned, thus satsfyng the relablty condton n the problem statement. If the loss n non-convex, the update procedure wll converge towards a local mnmum rather than the global mnmum (See Appendx C). 5.4 Example Ths example descrbes an applcaton of the update algorthm. EXAMPLE 2. Recall that the analyst has a drty SVM model on the drty data θ (d). She decdes that she has a budget of cleanng 1 records, and decdes to clean the 1 records n batches of 1 (set based on how fast she can clean the data, and how often she wants to see an updated result). All of the data s ntally treated as drty wth R drty = R and R clean =. The gradent of a basc SVM s gven by the followng functon: { y x f y x θ 1 φ(x, y, θ) = f y x θ 1 For each teraton t, a sample of 1 records S s drawn from R drty. ActveClean then apples the cleanng functon to the sample: {(x (c), y (c) )} = {C() : S} Usng these values, ActveClean estmates the gradent on the newly cleaned data: g S(θ) = p() φ(x(c), y (c), θ) S ActveClean also apples the gradent to the already clean data (ntally non-exstent): g C(θ) = 1 R clean Then, t calculates the update rule: θ (t+1) θ (t) γ ( R drty R φ(x (c) R clean, y (c), θ) g S(θ (t) ) + R clean R g C(θ (t) )) Fnally, R drty R drty S, R clean R clean + S, and contnue to the next teraton.

7 6. EFFICIENCY WITH SAMPLING The updater receved a sample wth probabltes p( ). For any dstrbuton where p( ) >, we can preserve correctness. ActveClean uses a samplng algorthm that selects the most valuable records to clean wth hgher probablty. 6.1 Oracle Samplng Problem Recall that the convergence rate of an SGD algorthm s bounded by σ 2 whch s the varance of the gradent. Intutvely, the varance measures how accurately the gradent s estmated from a unform sample. Other samplng dstrbutons, whle preservng the sample expected value, may have a lower varance. Thus, the oracle samplng problem s defned as a search over samplng dstrbutons to fnd the mnmum varance samplng dstrbuton. DEFINITION 1 (ORACLE SAMPLING PROBLEM). Gven a set of canddate drty data R drty, r R drty fnd samplng probabltes p(r) such that over all samples S of sze k t mnmzes: E( g S g 2 ) It can be shown that the optmal dstrbuton over records n R drty s probabltes proportonal to: p φ(x (c), y (c), θ (t) ) Ths s an establshed result, for thoroughness, we provde a proof n the appendx (Secton D), but ntutvely, records wth hgher gradents should be sampled wth hgher probablty as they affect the update more sgnfcantly. However, ActveClean cannot exclude records wth lower gradents as that would nduce a bas hurtng convergence. The problem s that the optmal dstrbuton leads to a chcken-and-egg problem: the optmal samplng dstrbuton requres knowng (x (c), y (c) ), however, cleanng s requred to know those values. 6.2 Drty Gradent Soluton Such an oracle does not exst, and one soluton s to use the gradent w.r.t to the drty data: p φ(x (d), y (d), θ (t) ) It turns out that the soluton works reasonably well n practce on our expermental datasets and has been studed n Machne Learnng as the Expected Gradent Length heurstc [31]. The contrbuton n ths work s ntegratng ths heurstc wth statstcally correct updates. However, ntutvely, approxmatng the oracle as closely as possble can result n mproved prortzaton. The subsequent secton descrbes two components, the detector and estmator, that can be used to mprove the convergence rate. Our experments suggest up-to a 2x mprovement n convergence when usng these optonal optmzatons (Secton 8.3.2). 7. OPTIMIZATIONS In ths secton, we descrbe two approaches to optmzaton, the Detector and the Estmator, that mprove the effcency of the cleanng process. Both approaches are desgned to ncrease the lkelhood that the Sampler wll pck drty records that, once cleaned, most move the model towards the true clean model. The Detector s ntended to learn the characterstcs that dstngush drty records from clean records whle the Estmator s desgned to estmate the amount that cleanng a gven drty record wll move the model towards the true optmal model. 7.1 The Detector The detector returns two mportant aspects of a record: (1) whether the record s drty, and (2) f t s drty, what s wrong wth the record. The sampler can use (1) to select a subset of drty records to sample at each batch and the estmator can use (2) estmate the value of data cleanng based on other records wth the same corrupton. ActveClean supports two types of detectors: a pror and adaptve. In former assumes that we know the set of drty records and how they are drty a pror to ActveClean, whle the latter adaptvely learns characterstcs of the drty data as part of runnng ActveClean A Pror Detector For many types of drtness such as mssng attrbute values and constrant volatons, t s possble to effcently enumerate a set of corrupted records and determne how the records are corrupted. DEFINITION 2 (A PRIORI DETECTION). Let r be a record n R. An a pror detector s a detector that returns a Boolean of whether the record s drty and a set of columns e r that are drty. D(r) = ({, 1}, e r) From the set of columns that are drty, fnd the correspondng features that are drty f r and labels that are drty l r. Here s an example ths defnton usng a data cleanng methodology proposed n the lterature. Constrant-based Repar: One model for detectng errors nvolves declarng constrants on the database. Detecton. Let Σ be a set of constrants on the relaton R. In the detecton step, the detector selects a subset of records R drty R that volate at least one constrant. The set e r s the set of columns for each record whch have a constrant volaton. EXAMPLE 3. An example of a constrant on the runnng example dataset s that the status of a contrbuton can be only allowed" or dsallowed". Any other value for status s an error. 7.2 Adaptve Detecton A pror detecton s not possble n all cases. The detector also supports adaptve detecton where detecton s learned from prevously cleaned data. Note that ths learnng" s dstnct from the learnng" at the end of the ppelne. The challenge n formulatng ths problem s that detector needs to descrbe how the data s drty (e.g. e r n the a pror case). The detector acheves ths by categorzng the corrupton nto u classes. These classes are corrupton categores that do not necessarly algn wth features, but every record s classfed wth at most one category. When usng adaptve detecton, the repar step has to clean the data and report to whch of the u classes the corrupted record belongs. When an example (x, y) s cleaned, the repar step labels t wth one of the clean, 1, 2,..., u + 1 classes (ncludng one for not drty"). It s possble that u ncreases each teraton as more types of drtness are dscovered. In

8 many real world datasets, data errors have localty, where smlar records tend to be smlarly corrupted. There are usually a small number of error classes even f a large number of records are corrupted. One approach for adaptve detecton s usng a statstcal classfer. Ths approach s partcularly suted for a small number data error classes, each of whch contanng many erroneous records. Ths problem can be addressed by any classfer, and we use an all-versus-one SVM n our experments. Another approach could be to adaptvely learn predcates that defne each of the error classes. For example, f records wth certan attrbutes are corrupted, a pattern tableau can be assgned to each class to select a set of possbly corrupted records. Ths approach s better suted than a statstcal approach for a large number of error classes or scarcty of errors. However, t reles on errors beng well algned wth certan attrbute values. DEFINITION 3 (ADAPTIVE CASE). Select the set of records for whch κ gves a postve error classfcaton (.e., one of the u error classes). After each sample of data s cleaned, the classfer κ s retraned. So the result s: D(r) = ({1, }, {1,..., u + 1}) Adaptve Detecton Wth OpenRefne: EXAMPLE 4. OpenRefne s a spreadsheet-based tool that allows users to explore and transform data. However, t s lmted to cleanng data that can ft n memory on a sngle computer. Snce the cleanng operatons are coupled wth data exploraton, ActveClean does not know what s drty n advance (the analyst may dscover new errors as she cleans). Suppose the analyst wants to use OpenRefne to clean the runnng example dataset wth ActveClean. She takes a sample of data from the entre dataset and uses the tool to dscover errors. For example, she fnds that some drugs are ncorrectly classfed as both drugs and devces. She then removes the devce attrbute for all records that have the drug name n queston. As she fxes the records, she tags each one wth a category tag of whch corrupton t belongs to. 7.3 The Estmator To get around the problem wth oracle samplng, the estmator wll estmate the cleaned value wth prevously cleaned data. The estmator wll also take advantage of the detector from the prevous secton. There are a number of dfferent approaches, such as regresson, that could be used to estmate the cleaned value gven the drty values. However, there s a problem of scarcty, where errors may affect a small number of records. As a result, the regresson approach would have to learn a multvarate functon wth only a few examples. Thus, hgh-dmensonal regresson ll-suted for the estmator. Conversely, t could try a very smple estmator that just calculates an average change and adds ths change to all of the gradents. Ths estmator can be hghly naccurate as t also apples the change to records that are known to be clean. ActveClean leverages the detector for an estmator between these two extremes. The estmator calculates average changes feature-by-feature and selectvely corrects the gradent when a feature s known to be corrupted based on the detector. It also apples a lnearzaton that leads to mproved estmates when the sample sze s small. We evaluate the lnearzaton n Secton 8.5 aganst alternatves, and fnd that t provdes more accurate estmates for a small number of samples cleaned. The result s a based estmator, and when the number of cleaned samples s large the alternatve technques are comparable or even slghtly better due to the bas. Estmaton For A Pror Detecton. If most of the features are correct, t would seem lke the gradent s only ncorrect n one or two of ts components. The problem s that the gradent φ( ) can be a very nonlnear functon of the features that couple features together. For example, the gradent for lnear regresson s: φ(x, y, θ) = (θ T x y)x It s not possble to solate the effect of a change of one feature on the gradent. Even f one of the features s corrupted, all of the gradent components wll be ncorrect. To address ths problem, the gradent can be approxmated n a way that the effects of drty features on the gradent are decoupled. Recall, n the a pror detecton problem, that assocated wth each r R drty s a set of errors f r, l r whch s a set that dentfes a set of corrupted features and labels. Ths property can be used to construct a coarse estmate of the clean value. The man dea s to calculate average changes for each feature, then gven an uncleaned (but drty) record, add these average changes to correct the gradent. To formalze the ntuton, nstead of computng the actual gradent wth respect to the true clean values, compute the condtonal expectaton gven that a set of features and labels f r, l r are corrupted: p E( φ(x (c), y (c), θ (t) ) f r, l r) Corrupted features are defned as that: / f r = x (c) [] x (d) [] = / l r = y (c) [] y (d) [] = The needed approxmaton represents a lnearzaton of the errors, and the resultng approxmaton wll be of the form: p(r) φ(x, y, θ (t) ) + M x rx + M y ry where M x, M y are matrces and rx and ry are vectors wth one component for each feature and label where each value s the average change for those features that are corrupted and otherwse. Essentally, t the gradent wth respect to the drty data plus some lnear correcton factor. In the appendx, we present a dervaton usng a Taylor seres expanson and a number of M x and M y matrces for common convex losses (Appendx E and F). The appendx also descrbes how to mantan rx and ry as cleanng progresses. Estmaton For Adaptve Case. A smlar procedure holds n the adaptve settng, however, t requres reformulaton. Here, ActveClean uses u corrupton classes provded by the detector. Instead of condtonng on the features that are corrupted, the estmator condtons on the classes. So for each error class, t computes a ux and uy. These are the average change n the features gven that class and the average change n labels gven that class. p(r u) φ(x, y, θ (t) ) + M x ux + M y uy Example Here s an example of usng the optmzaton to select a sample of data for cleanng.

9 EXAMPLE 5. Consder usng ActveClean wth an a pror detector. Let us assume that there are no errors n the labels and only errors n the features. Then, each tranng example wll have a set of corrupted features (e.g., {1, 2, 6}, {1, 2, 15}). Suppose that the cleaner has just cleaned the records r 1 and r 2 represented as tuples wth ther corrupted feature set: (r 1,{1, 2, 3}), (r 2,{1, 2, 6}). For each feature, ActveClean mantans the average change between drty and clean n a value n a vector x[] for those records corrupted on that feature. Then, gven a new record (r 3,{1, 2, 3, 6}), r3 x s the vector x where component s set to f the feature s not corrupted. Suppose the data analyst s usng an SVM, then the M x matrx s as follows: { y[] f y x θ 1 M x[, ] = f y x θ 1 Thus, we calculate a samplng weght for record r 3: p(r 3) φ(x, y, θ (t) ) + M x r3 x To turn the result nto a probablty dstrbuton, ActveClean normalzes over all drty records. 8. EXPERIMENTS Frst, the experments evaluate how varous types of corrupted data beneft from data cleanng. Next, the experments explore dfferent prortzaton and model update schemes for progressve data cleanng. Fnally, ActveClean s evaluated end-to-end n a number of real-world data cleanng scenaros. 8.1 Expermental Setup and Notaton The man metrc for evaluaton s a relatve measure of the traned model and the model f all of the data s cleaned. Relatve Model Error. Let θ be the model traned on the drty data, and let θ be the model traned on the same data f t was cleaned. Then the model error s defned as θ θ. θ Scenaros Income Classfcaton (Adult): In ths dataset of 45,552 records, the task s to predct the ncome bracket (bnary) from 12 numercal and categorcal covarates wth an SVM classfer. Sezure Classfcaton (EEG): In ths dataset, the task s to predct the onset of a sezure (bnary) from 15 numercal covarates wth a thresholded Lnear Regresson. There are 1498 data ponts n ths dataset. Ths classfcaton task s nherently hard wth an accuracy on completely clean data of only 65%. Handwrtng Recognton (MNIST) 1 : In ths dataset, the task s to classfy 6, mages of handwrtten mages nto 1 categores wth an one-to-all multclass SVM classfer. The unque part of ths dataset s the featurzed data conssts of a 784 dmensonal vector whch ncludes edge detectors and raw mage patches. Dollars For Docs: The dataset has 24,89 records wth 5 textual attrbutes and one numercal attrbute. The dataset s featurzed wth bag-of-words featurzaton model for the textual attrbutes whch resulted n a 221 dmensonal feature 1 Dataset vector, and a bnary SVM s used to classfy the status of the medcal donatons. World Bank: The dataset has 193 records of country name, populaton, and varous macro-economcs statstcs. The values are lsted wth the date at whch they were acqured. Ths allowed us to determne that records from smaller and less populous countres were more lkely to be out-of-date Compared Algorthms Here are the alternatve methodologes evaluated n the experments: Robust Logstc Regresson [14]. Feng et al. proposed a varant of logstc regresson that s robust to outlers. We chose ths algorthm because t s a robust extenson of the convex regularzed loss model, leadng to a better apples-toapples comparson between the technques. (See detals n Appendx H.1) Dscardng Drty Data. As a baselne, drty data are dscarded. SampleClean (SC) [33]. SampleClean takes a sample of data, apples data cleanng, and then trans a model to completon on the sample. Actve Learnng (AL) [18]. To farly evaluate Actve Learnng, we frst apply our gradent update to ensure correctness. Wthn each teraton, examples are prortzed by dstance to the decson boundary (called Uncertanty Samplng n [31]). However, we do not nclude our optmzatons such as detecton and estmaton. ActveClean Oracle (AC+O): In ActveClean Oracle, nstead of an estmaton and detecton step, the true clean value s used to evaluate the theoretcal deal performance of Actve- Clean. 8.2 Does Data Cleanng Matter? The frst experment evaluates the benefts of data cleanng on two of the example datasets (EEG and Adult). Our goal s to understand whch types of data corrupton are amenable to data cleanng and whch are better suted for robust statstcal technques. The experment compares four schemes: (1) full data cleanng, (2) baselne of no cleanng, (3) dscardng the drty data, and (4) robust logstc regresson,. We corrupted 5% of the tranng examples n each dataset n two dfferent ways: Random Corrupton: Smulated hgh-magntude random outlers. 5% of the examples are selected at random and a random feature s replaced wth 3 tmes the hghest feature value. Systematc Corrupton: Smulated nnocuous lookng (but stll ncorrect) systematc corrupton. The model s traned on the clean data, and the three most mportant features (hghest weghted) are dentfed. The examples are sorted by each of these features and the top examples are corrupted wth the mean value for that feature (5% corrupton n all). It s mportant to note that examples can have multple corrupted features. Fgure 4 shows the test accuracy for models traned on both types of data wth the dfferent technques. The robust method performs well on the random hgh-magntude outlers wth only a 2.% reducton n clean test accuracy for EEG and 2.5% reducton for Adult. In the random settng, dscardng drty data also performs relatvely well. However,

10 Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (a) Randomly Corrupted Data EEG Adult Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (b) Systematcally Corrupted Data Fgure 4: (a) Robust technques and dscardng data work when corrupted data are random and look atypcal. (b) Data cleanng can provde relable performance n both the systematcally corrupted settng and randomly corrupted settng. the robust method falters on the systematc corrupton wth a 9.1% reducton n clean test accuracy for EEG and 1.5% reducton for Adult. The problem s that wthout cleanng, there s no way to know f the corrupton s random or systematc and when to trust a robust method. Whle data cleanng requres more effort, t provdes benefts n both settngs. In the remanng experments, unless otherwse noted, the experments use systematc corrupton. Summary: A 5% systematc corrupton can ntroduce a 1% reducton n test accuracy even when usng a robust method. 8.3 ActveClean: A Pror Detecton The next set of experments evaluate dfferent approaches to cleanng a sample of data compared to ActveClean usng a pror detecton. A pror detecton assumes that all of the corrupted records are known n advance but ther clean values are unknown Actve Learnng and SampleClean The next experment evaluates the samples-to-error tradeoff between four alternatve algorthms: ActveClean (AC), SampleClean, Actve Learnng, and ActveClean +Oracle (AC+O). Fgure 5 shows the model error and test accuracy as a functon of the number of cleaned records. In terms of model error, ActveClean gves ts largest benefts for small sample szes. For 5 cleaned records of the Adult dataset, Actve- Clean has 6.1x less error than SampleClean and 2.1x less error than Actve Learnng. For 5 cleaned records of the EEG dataset, ActveClean has 9.6x less error than SampleClean and 2.4x less error than Actve Learnng. Both Actve Learnng and ActveClean beneft from the ntalzaton wth the drty model as they do not retran ther models from scratch, and ActveClean mproves on ths performance wth detecton and error estmaton. Actve Learnng has no noton of drty and clean data, and therefore prortzes wth respect to the drty data. These gans n model error also correlate well to mprovements n test error (defned as the test accuracy dfference w.r.t cleanng all data). The test error converges more quckly than model error, emphaszng the benefts of progressve data cleanng, snce t s not neccessary to clean all the data to get a model wth essentally the same performance as the clean model. For example, to acheve a test error of 1% on the Adult dataset, ActveClean cleans 5 fewer records than Actve Learnng. Summary: ActveClean wth a pror detecton returns results that are more than 6x more accurate than SampleClean and 2x more accurate than Actve Learnng for cleanng 5 records. EEG Adult Model Error % Test Error % (a) Adult # Records Cleaned Model Error % Test Error % (b) EEG # Records Cleaned Fgure 5: The relatve model error as a functon of the number of examples cleaned. ActveClean converges wth a smaller sample sze to the true result n comparson to Actve Learnng and SampleClean. Fgure 6: -D denotes no detecton, and -D-I denotes no detecton and no mportance samplng. Both optmzatons sgnfcantly help ActveClean outperform Sample- Clean and Actve Learnng Source of Improvements The next experment compares the performance of Actve- Clean wth and wthout varous optmzatons at 5 records cleaned pont. ActveClean wthout detecton s denoted as (AC-D) (that s at each teraton we sample from the entre drty data), and ActveClean wthout detecton and mportance samplng s denoted as (AC-D-I). Fgure 6 plots the relatve error of the alternatves and ActveClean wth and wthout the optmzatons. Wthout detecton (AC-D), ActveClean s stll more accurate than Actve Learnng. Removng the mportance samplng, ActveClean s slghtly worse than Actve Learnng on the Adult dataset but s comparable on the EEG dataset. Summary: Both a pror detecton and non-unform samplng sgnfcantly contrbute to the gans over Actve Learnng Mxng Drty and Clean Data Tranng a model on mxed data s an unrelable methodology lackng the same guarantees as Actve Learnng or SampleClean even n the smplest of cases. For thoroughness, the next experments nclude the model error as a functon of records cleaned n comparson to ActveClean. Fgure 7 plots the same curves as the prevous experment comparng ActveClean, Actve Learnng, and two mxed data algorthms. PC randomly samples data, clean, and wrtes-back the cleaned data. PC+D randomly samples data from usng the drty data detector, cleans, and wrtes-back the cleaned data. For these errors PC and PC+D gve reasonable results

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Biostatistics 615/815

Biostatistics 615/815 The E-M Algorthm Bostatstcs 615/815 Lecture 17 Last Lecture: The Smplex Method General method for optmzaton Makes few assumptons about functon Crawls towards mnmum Some recommendatons Multple startng ponts

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mnng Massve Datasets Jure Leskovec, Stanford Unversty http://cs46.stanford.edu /19/013 Jure Leskovec, Stanford CS46: Mnng Massve Datasets, http://cs46.stanford.edu Perceptron: y = sgn( x Ho to fnd

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

Data Mining: Model Evaluation

Data Mining: Model Evaluation Data Mnng: Model Evaluaton Aprl 16, 2013 1 Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated. Some Advanced SP Tools 1. umulatve Sum ontrol (usum) hart For the data shown n Table 9-1, the x chart can be generated. However, the shft taken place at sample #21 s not apparent. 92 For ths set samples,

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros. Fttng & Matchng Lecture 4 Prof. Bregler Sldes from: S. Lazebnk, S. Setz, M. Pollefeys, A. Effros. How do we buld panorama? We need to match (algn) mages Matchng wth Features Detect feature ponts n both

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

LECTURE : MANIFOLD LEARNING

LECTURE : MANIFOLD LEARNING LECTURE : MANIFOLD LEARNING Rta Osadchy Some sldes are due to L.Saul, V. C. Raykar, N. Verma Topcs PCA MDS IsoMap LLE EgenMaps Done! Dmensonalty Reducton Data representaton Inputs are real-valued vectors

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Parameter estimation for incomplete bivariate longitudinal data in clinical trials Parameter estmaton for ncomplete bvarate longtudnal data n clncal trals Naum M. Khutoryansky Novo Nordsk Pharmaceutcals, Inc., Prnceton, NJ ABSTRACT Bvarate models are useful when analyzng longtudnal data

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Support Vector Machines. CS534 - Machine Learning

Support Vector Machines. CS534 - Machine Learning Support Vector Machnes CS534 - Machne Learnng Perceptron Revsted: Lnear Separators Bnar classfcaton can be veed as the task of separatng classes n feature space: b > 0 b 0 b < 0 f() sgn( b) Lnear Separators

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Announcements. Supervised Learning

Announcements. Supervised Learning Announcements See Chapter 5 of Duda, Hart, and Stork. Tutoral by Burge lnked to on web page. Supervsed Learnng Classfcaton wth labeled eamples. Images vectors n hgh-d space. Supervsed Learnng Labeled eamples

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

Three supervised learning methods on pen digits character recognition dataset

Three supervised learning methods on pen digits character recognition dataset Three supervsed learnng methods on pen dgts character recognton dataset Chrs Flezach Department of Computer Scence and Engneerng Unversty of Calforna, San Dego San Dego, CA 92093 cflezac@cs.ucsd.edu Satoru

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Signature and Lexicon Pruning Techniques

Signature and Lexicon Pruning Techniques Sgnature and Lexcon Prunng Technques Srnvas Palla, Hansheng Le, Venu Govndaraju Centre for Unfed Bometrcs and Sensors Unversty at Buffalo {spalla2, hle, govnd}@cedar.buffalo.edu Abstract Handwrtten word

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

A Robust Method for Estimating the Fundamental Matrix

A Robust Method for Estimating the Fundamental Matrix Proc. VIIth Dgtal Image Computng: Technques and Applcatons, Sun C., Talbot H., Ourseln S. and Adraansen T. (Eds.), 0- Dec. 003, Sydney A Robust Method for Estmatng the Fundamental Matrx C.L. Feng and Y.S.

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces Range mages For many structured lght scanners, the range data forms a hghly regular pattern known as a range mage. he samplng pattern s determned by the specfc scanner. Range mage regstraton 1 Examples

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

EXTENDED BIC CRITERION FOR MODEL SELECTION

EXTENDED BIC CRITERION FOR MODEL SELECTION IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky Improvng Low Densty Party Check Codes Over the Erasure Channel The Nelder Mead Downhll Smplex Method Scott Stransky Programmng n conjuncton wth: Bors Cukalovc 18.413 Fnal Project Sprng 2004 Page 1 Abstract

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

GSLM Operations Research II Fall 13/14

GSLM Operations Research II Fall 13/14 GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are

More information

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Face Recognition University at Buffalo CSE666 Lecture Slides Resources: Face Recognton Unversty at Buffalo CSE666 Lecture Sldes Resources: http://www.face-rec.org/algorthms/ Overvew of face recognton algorthms Correlaton - Pxel based correspondence between two face mages Structural

More information

Machine Learning. K-means Algorithm

Machine Learning. K-means Algorithm Macne Learnng CS 6375 --- Sprng 2015 Gaussan Mture Model GMM pectaton Mamzaton M Acknowledgement: some sldes adopted from Crstoper Bsop Vncent Ng. 1 K-means Algortm Specal case of M Goal: represent a data

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

5 The Primal-Dual Method

5 The Primal-Dual Method 5 The Prmal-Dual Method Orgnally desgned as a method for solvng lnear programs, where t reduces weghted optmzaton problems to smpler combnatoral ones, the prmal-dual method (PDM) has receved much attenton

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005 Exercses (Part 4) Introducton to R UCLA/CCPR John Fox, February 2005 1. A challengng problem: Iterated weghted least squares (IWLS) s a standard method of fttng generalzed lnear models to data. As descrbed

More information

CSE 326: Data Structures Quicksort Comparison Sorting Bound

CSE 326: Data Structures Quicksort Comparison Sorting Bound CSE 326: Data Structures Qucksort Comparson Sortng Bound Steve Setz Wnter 2009 Qucksort Qucksort uses a dvde and conquer strategy, but does not requre the O(N) extra space that MergeSort does. Here s the

More information

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 48 CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 3.1 INTRODUCTION The raw mcroarray data s bascally an mage wth dfferent colors ndcatng hybrdzaton (Xue

More information

Fusion Performance Model for Distributed Tracking and Classification

Fusion Performance Model for Distributed Tracking and Classification Fuson Performance Model for Dstrbuted rackng and Classfcaton K.C. Chang and Yng Song Dept. of SEOR, School of I&E George Mason Unversty FAIRFAX, VA kchang@gmu.edu Martn Lggns Verdan Systems Dvson, Inc.

More information

SVM-based Learning for Multiple Model Estimation

SVM-based Learning for Multiple Model Estimation SVM-based Learnng for Multple Model Estmaton Vladmr Cherkassky and Yunqan Ma Department of Electrcal and Computer Engneerng Unversty of Mnnesota Mnneapols, MN 55455 {cherkass,myq}@ece.umn.edu Abstract:

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids Verification. General Terms Algorithms

Categories and Subject Descriptors B.7.2 [Integrated Circuits]: Design Aids Verification. General Terms Algorithms 3. Fndng Determnstc Soluton from Underdetermned Equaton: Large-Scale Performance Modelng by Least Angle Regresson Xn L ECE Department, Carnege Mellon Unversty Forbs Avenue, Pttsburgh, PA 3 xnl@ece.cmu.edu

More information

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010 Smulaton: Solvng Dynamc Models ABE 5646 Week Chapter 2, Sprng 200 Week Descrpton Readng Materal Mar 5- Mar 9 Evaluatng [Crop] Models Comparng a model wth data - Graphcal, errors - Measures of agreement

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Cost-efficient deployment of distributed software services

Cost-efficient deployment of distributed software services 1/30 Cost-effcent deployment of dstrbuted software servces csorba@tem.ntnu.no 2/30 Short ntroducton & contents Cost-effcent deployment of dstrbuted software servces Cost functons Bo-nspred decentralzed

More information