arxiv: v1 [cs.db] 15 Jan 2016

Size: px

Start display at page:

Download "arxiv: v1 [cs.db] 15 Jan 2016"

Laurence Ferguson
5 years ago
Views:

1 ActveClean: Interactve Data Cleanng Whle Learnng Convex Loss Models Sanjay Krshnan, Jannan Wang, Eugene Wu, Mchael J. Frankln, Ken Goldberg UC Berkeley, Columba Unversty {sanjaykrshnan, jnwang, frankln, arxv: v1 [cs.db] 15 Jan 216 ABSTRACT Data cleanng s often an mportant step to ensure that predctve models, such as regresson and classfcaton, are not affected by systematc errors such as nconsstent, out-of-date, or outler data. Identfyng drty data s often a manual and teratve process, and can be challengng on large datasets. However, many data cleanng workflows can ntroduce subtle bases nto the tranng processes due to volaton of ndependence assumptons. We propose ActveClean, a progressve cleanng approach where the model s updated ncrementally nstead of re-tranng and can guarantee accuracy on partally cleaned data. ActveClean supports a popular class of models called convex loss models (e.g., lnear regresson and SVMs). ActveClean also leverages the structure of a user s model to prortze cleanng those records lkely to affect the results. We evaluate ActveClean on fve real-world datasets UCI Adult, UCI EEG, MNIST, Dollars For Docs, and World- Bank wth both real and synthetc errors. Our results suggest that our proposed optmzatons can mprove model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fxed cleanng budget and on all real drty datasets, ActveClean returns more accurate models than unform samplng and Actve Learnng. 1. INTRODUCTION Machne Learnng on large and growng datasets s a key data management challenge wth sgnfcant nterest n both ndustry and academa [1, 5, 1, 2]. Despte a number of breakthroughs n reducng tranng tme, predctve modelng can stll be a tedous and tme-consumng task for an analyst. Data often arrve drty, ncludng mssng, ncorrect, or nconsstent attrbutes, and analysts wdely report that data cleanng and other forms of pre-processng account for up to 8% of ther effort [3, 22]. Whle data cleanng s an extensvely studed problem, the predctve modelng settng poses a number of new challenges: (1) hgh dmensonalty can amplfy even a small amount of erroneous records [36], (2) the complexty can make t dffcult to trace the consequnces of an error, and (3) there are often subtle techncal ccondtons (e.g., ndependent and dentcally dstrbuted) that can be volated by data cleanng. Consequently, technques that have been desgned for tradtonal SQL analytcs may be neffcent or even unrelable. In ths paper, we study the relatonshp between data cleanng and model tranng workflows and explore how to apply exstng data cleanng approaches wth provable guarantees. One of the man bottlenecks n data cleanng s the human effort n determnng whch data are drty and then developng rules or software to correct the problems. For some types of drty data, such as nconsstent values, model tranng may seemngly succeed albet, wth potental subtle naccuraces n the model. For example, battery-powered sensors can transmt unrelable measurements when battery levels are low [21]. Smlarly, data entered by humans can be susceptble to a varety of nconsstences (e.g., typos), and unntentonal cogntve bases [23]. Such problems are often addressed n tme-consumng loop where the analys trans a model, nspects the model and ts predctons, clean some data, and re-tran. Ths teratve process s the de facto standard, but wthout approprate care, can lead to several serous statstcal ssues. Due to the well-known Smpson s paradox, models traned on a mx of drty and clean data can have very msleadng results even n smple scenaros (Fgure 1). Furthermore, f the canddate drty records are not dentfed wth a known samplng dstrbuton, the statstcal ndependence assumptons for most tranng methods are volated. The volatons of these assumptons can ntroduce confoundng bases. To ths end, we desgned ActveClean whch trans predctve models whle allowng for teratve data cleanng and has accuracy guarantees. ActveClean automates the drty data dentfcaton process and the model update process, thereby abstractng these two error-prone steps away from the analyst. ActveClean s nspred by the recent success of progressve data cleanng where a user can gradually clean more data untl the desred accuracy s reached [6, 34, 3, 17, 26, 38, 37]. We focus on a popular class of models called convex loss models (e.g., ncludes lnear regresson and SVMs) and show that the Smpson s paradox problem can be avoded usng teratve mantenance of a model rather than re-tranng. Ths process leverages the convex structure of the model rather than treatng t lke a black-box, and we apply convergence arguments from convex optmzaton theory. We propose several novel optmzatons that leverage nformaton from the model to gude data cleanng towards the records most lkely to be drty and most lkely to affect the results. To summarze the contrbutons: Correctness (Secton 5). We show how to update a drty model gven newly cleaned data. Ths update converges monotoncally n expectaton. For a batch sze b and teratons T, t converges wth rate O( 1 bt ). Effcency (Secton 6). We derve a theoretcal optmal samplng dstrbuton that mnmzes the update error and an approxmaton to estmate the theoretcal optmum. Detecton and Estmaton (Secton 7). We show how

2 ActveClean can be ntegrated wth data detecton to gude data cleanng towards records expected to be drty. The experments evaluate these components on four datasets wth real and synthetc corrupton (Secton 8). Results suggests that for a fxed cleanng budget, ActveClean returns more accurate models than unform samplng and Actve Learnng when systematc corrupton s sparse. 2. BACKGROUND AND PROBLEM SETUP Ths secton formalzes the teratve data cleanng and tranng process and hghlghts an example applcaton. 2.1 Predctve Modelng The user provdes a relaton R and wshes to tran a model usng the data n R. Ths work focuses on a class of wellanalyzed predctve analytcs problems; ones that can be expressed as the mnmzaton of convex loss functons. Convex loss mnmzaton problems are amenable to a varety of ncremental optmzaton methodologes wth provable guarantees (see Fredman, Haste, and Tbshran [15] for an ntroducton). Examples nclude generalzed lnear models (ncludng lnear and logstc regresson), support vector machnes, and n fact, means and medans are also specal cases. We assume that the user provdes a featurzer F ( ) that maps every record r R to a feature vector x and label y. For labeled tranng examples {(x, y )} N =1, the problem s to fnd a vector of model parameters θ by mnmzng a loss functon φ over all tranng examples: θ = arg mn θ N φ(x, y, θ) =1 Where φ s a convex functon n θ. For example, n a lnear regresson φ s: φ(x, y, θ) = θ T x y 2 2 Typcally, a regularzaton term r(θ) s added to ths problem. r(θ) penalzes hgh or low values of feature weghts n θ to avod overfttng to nose n the tranng examples. θ = arg mn θ N φ(x, y, θ) + r(θ) (1) =1 In ths work, wthout loss of generalty, we wll nclude the regularzaton as part of the loss functon.e., φ(x, y, θ) ncludes r(θ). 2.2 Data Cleanng We consder corrupton that affects the attrbute values of records. Ths does not cover errors that smultaneously affect multple records such as record duplcaton or structure such as schema transformaton. Examples of supported cleanng operatons nclude, batch resolvng common nconsstences (e.g., mergng U.S.A" and Unted States"), flterng outlers (e.g., removng records wth values > 1e6), and standardzng attrbute semantcs (e.g., 1.2 mles" and 1.93 km"). We are partcularly nterested n those errors that are dffcult or tme-consumng to clean, and requre the analyst to examne an erroneous record, and determne the approprate acton possbly leveragng knowledge of the current best model. We represent ths operaton as Clean( ) whch can be appled to a record r (or a set of records) to recover the clean record r = Clean(r). Formally, we treat the Clean( ) as an expensve user-defned functon composed of determnstc schema-preservng map and flter operatons appled to a subset of rows n the relaton. A relaton s defned as clean f R clean = Clean(R clean ). Therefore, for every r R clean there exsts a unque r R n the drty data. The map and flter cleanng model s not a fundamental restrcton of Actve- Clean, and Appendx A dscusses a compatble set of records" cleanng model. 2.3 Iteraton As an example of how Clean( ) fts nto an teratve analyss process, consder an analyst tranng a regresson and dentfyng outlers. When she examnes one of the outlers, she realzes that the base data (pror to featurzaton) has a formattng nconsstency that leads to ncorrect parsng of the numercal values. She apples a batch fx (.e., Clean( )) to all of the outlers wth the same error, and re-trans the model. Ths teratve process can be descrbed as the followng pseudocode loop: 1. Int(ter) 2. current_model = Tran(R) 3. For each t n {1,..., ter} (a) drty_sample = Identfy(R,current_model) (b) clean_sample = Clean(drty_sample) (c) current_model = Update(clean_sample, R) 4. Output: current_model Whle we have already dscussed T ran( ) and Clean( ), the analyst stll has to defne the prmtves Identfy( ) and Update( ). For Identfy( ), gven a the current best model, the analyst must specfy some crtera to select a set of records to examne. And n Update( ), the analyst must decde how to update the model gven newly cleaned data. It turns out that these prmtves are not trval to mplement snce the straght-forward solutons can actually lead to dvergence of the traned models. 2.4 Challenges Correctness: Let us assume that the analyst has mplemented an Identfy( ) functon that returns k canddate drty records. The straght-forward applcaton data cleanng s to repar the corrupton n place, and re-tran the model after each repar. Suppose k N records are cleaned, but all of the remanng drty records are retaned n the dataset. Fgure 1 hghlghts the dangers of ths approach on a very smple drty dataset and a lnear regresson model.e., the best ft lne for two varables. One of the varables s systematcally corrupted wth a translaton n the x-axs (Fgure 1a). The drty data s marked n red and the clean data n blue, and they are shown wth ther respectve best ft lnes. After cleanng only two of the data ponts (Fgure 1b), the resultng best ft lne s n the opposte drecton of the true model. Aggregates over mxtures of dfferent populatons of data can result n spurous relatonshps due to the well-known phenomenon called Smpson s paradox [32]. Smpson s paradox s by no means a corner case, and t has affected the valdty of a number of hgh-profle studes [35]; even n the smple case of takng an average over a dataset. Predctve models are hgh-dmensonal generalzatons of these aggregates wthout closed form technques to compensate for these

bases. Thus, tranng models on a mxture of drty and clean data can lead to unrelable results, where artfcal trends ntroduced by the mxture can be confused for the effects of data cleanng.

(c) Small samples of only clean data can result n smlarly naccurate models.

3 bases. Thus, tranng models on a mxture of drty and clean data can lead to unrelable results, where artfcal trends ntroduced by the mxture can be confused for the effects of data cleanng. Fgure 1: (a) Systematc corrupton n one varable can lead to a shfted model. (b) Mxed drty and clean data results n a less accurate model than no cleanng. (c) Small samples of only clean data can result n smlarly naccurate models. An alternatve s to avod the drty data altogether nstead of mxng the two populatons, and the model re-tranng s restrcted to only data that are known to be clean. Ths approach s smlar to SampleClean [33], whch was proposed to approxmate the results of aggregate queres by applyng them to a clean sample of data. However, hgh-dmensonal models are hghly senstve to sample sze. Fgure 1c llustrates that, even n two dmensons, models traned from small samples can be as ncorrect as the mxng soluton descrbed before. Effcency: Conversely, hypothetcally assume that the analyst has mplemented a correct Update( ) prmtve and mplements Identfy( ) wth a technque such as Actve Learnng to select records to clean [37, 38, 16]. Actve learnng s a technque to carefully select the set of examples to learn the most accurate model. However, these selecton crtera are desgned for statonary data dstrbutons, an assumpton whch s not true n ths settng. As more data are cleaned, the data dstrbuton changes. Data whch may look unmportant n the drty data mght be very valuable to clean n realty, and thus any prortzaton has to predct a record s value wth respect to an antcpated clean model. 2.5 The Need For Automaton ActveClean s a framework that mplements the Identfy( ) and Update( ) prmtves for the analyst. By automatng the teratve process, ActveClean ensures relable models wth convergence guarantees. The analyst frst ntalzes Actve- Clean wth a drty model. ActveClean carefuly selects small batches of data to clean based on data that are lkely to be drty and lkely to affect the model. The analyst apples data cleanng to these batches, and ActveClean updates the model wth an ncremental optmzaton technque. Machne learnng has been appled n pror work to mprove the effcency of data cleanng [37, 38, 16]. Human nput, ether for cleanng or valdaton of automated cleanng, s often expensve and mpractcal for large datasets. A model can learn rules from a small set of examples cleaned (or valdated) by a human, and actve learnng s a technque to carefully select the set of examples to learn the most accurate model. Ths model can be used to extrapolate repars to not-yet-cleaned data, and the goal of these approaches s to provde the cleanest possble dataset ndependent of the subsequent analytcs or query processng. These approaches, whle very effectve, suffer from composblty problems when placed nsde cleanng and tranng loops. To summarze, ActveClean consders data cleanng durng model tranng, whle these technques consder model tranng for data cleanng. One of the prmary contrbutons of ths work s an ncremental model update algorthm wth correctness guarantees for mxtures of data. 2.6 Use Case: Dollars for Docs [2] ProPublca collected a dataset of corporate donatons to doctors to analyze conflcts of nterest. They reported that some doctors receved over $5, n travel, meals, and consultaton expenses [4]. ProPublca laborously curated and cleaned a dataset from the Centers for Medcare and Medcad Servces that lsted nearly 25, research donatons, and aggregated these donatons by physcan, drug, and pharmaceutcal company. We collected the raw unaggregated data and explored whether suspect donatons could be predcted wth a model. Ths problem s typcal of analyss scenaros based on observatonal data seen n fnance, nsurance, medcne, and nvestgatve journalsm. The dataset has the followng schema: C o n t r b u t o n ( p _ s p e c a l t y, drug_name, devce_name, c o r p o r a t o n, amount, d s p u t e, s t a t u s ) p_specalty s a textual attrbute descrbng the specalty of the doctor recevng the donaton. drug_name s the branded name of the drug n the research study (null f not a drug). devce_name s the branded name of the devce n the study (null f not a devce). corporaton s the name of the pharmaceutcal provdng the donaton. amount s a numercal attrbute representng the donaton amount. dspute s a Boolean attrbute descrbng whether the research was dsputed. status s a strng label descrbng whether the donaton was allowed under the declared research protocol. The goal s to predct dsallowed donaton. However, ths dataset s very drty, and the systematc nature of the data corrupton can result n an naccurate model. On the ProPublca webste [2], they lst numerous types of data problems that had to be cleaned before publshng the data (see Appendx I). For example, the most sgnfcant donatons were made by large companes whose names were also more often nconsstently represented n the data, e.g., Pfzer Inc.", Pfzer Incorporated", Pfzer". In such scenaros, the effect of systematc error can be serous. Duplcate representatons could artfcally reduce the correlaton between these enttes and suspected contrbutons. There were nearly 4, of the 25, records that had ether namng nconsstences or other nconsstences n labelng the allowed or dsallowed status. Wthout data cleanng, the detecton rate usng a Support Vector Machne was 66%. Applyng the data cleanng to the entre dataset mproved ths rate to 97% n the clean data (Secton 8.6.1), and the experments descrbe how ActveClean can acheve an 8% detecton rate for less than 1.6% of the records cleaned. 3. PROBLEM FORMALIZATION Ths secton formalzes the problems addressed n the paper. 3.1 Notaton and Setup

4 The user provdes a relaton R, a cleaner C( ), a featurzer F ( ), and a convex loss problem defned by the loss φ( ). A total of k records wll be cleaned n batches of sze b, so there wll be k teratons. We use the followng notaton to represent relevant ntermedate b states: Drty Model: θ (d) s the model traned on R (wthout cleanng) wth the featurzer F ( ) and loss φ( ). Ths serves as an ntalzaton to ActveClean. Drty Records: R drty R s the subset of records that are stll drty. As more data are cleaned R drty {}. Clean Records: R clean R s the subset of records that are clean,.e., the complement of R drty. Samples: S s a sample (possbly non-unform but wth known probabltes) of the records R drty. The clean sample s denoted by S clean = C(S). Clean Model: θ (c) s the optmal clean model,.e., the model traned on a fully cleaned relaton. Current Model: θ (t) s the current best model at teraton t {1,..., k }, and b θ() = θ (d). There are two metrcs that we wll use to measure the performance of ActveClean: Model Error. The model error s defned as θ (t) θ (c). Testng Error. Let T (θ (t) ) be the out-of-sample testng error when the current best model s appled to the clean data, and T (θ (c) ) be the test error when the clean model s appled to the clean data. The testng error s defned as T (θ (t) ) T (θ (c) ) 3.2 Problem 1. Correct Update Problem Gven newly cleaned data S clean and the current best model θ (t), the model update problem s to calculate θ (t+1). θ (t+1) wll have some error wth respect to the true model θ (c), whch we denote as: error(θ (t+1) ) = θ (t+1) θ (c) Snce a sample of data are cleaned, t s only meanngful to talk about expected errors. We call the update algorthm relable" f the expected error s upper bounded by a monotoncally decreasng functon µ of the amount of cleaned data: E(error(θ new )) = O(µ( S clean )) Intutvely, relable" means that more cleanng should mply more accuracy. The Correct Update Problem s to relably update the model θ (t) wth a sample of cleaned data. 3.3 Problem 2. Effcency Problem The effcency problem s to select S clean such that the expected error E(error(θ (t) )) s mnmzed. ActveClean uses prevously cleaned data to estmate the value of data cleanng on new records. Then t draws a sample of records S R drty. Ths s a non-unform sample where each record r has a samplng probablty p(r) based on the estmates. We derve the optmal samplng dstrbuton for the SGD updates, and show how the theoretcal optmum can be approxmated. The Effcency Problem s to select a samplng dstrbuton p( ) over all records such that the expected error w.r.t to the model f traned on fully clean data s mnmzed. 4. ARCHITECTURE Ths secton presents the ActveClean archtecture. 4.1 Overvew Fgure 2 llustrates the ActveClean archtecture. The dotted boxes descrbe optonal components that the user can provde to mprove the effcency of the system Requred User Input Model: The user provdes a predctve model (e.g., SVM) specfed as a convex loss optmzaton problem φ( ) and a featurzer F ( ) that maps a record to ts feature vector x and label y. Cleanng Functon: The user provdes a functon C( ) (mplemented va software or crowdsourcng) that maps drty records to clean records as per our defnton n Secton??. Batches: Data are cleaned n batches of sze b and the user can change these settngs f she desres more or less frequent model updates. The choce of b does affect the convergence rate. Secton 5 dscusses the effcency and convergence trade-offs of dfferent values of b. We emprcally fnd that a batch sze of 5 performs well across dfferent datasets and use that as a default. A cleanng budget k can be used as a stoppng crteron once C( ) has been called k tmes, and so the number of teratons of ActveClean s T = k. Alternatvely, the user can clean data untl the model s of suffcent b accuracy to make a decson Basc Data Flow The system frst trans the model φ( ) on the drty dataset to fnd an ntal model θ (d) that the system wll subsequently mprove. The sampler selects a sample of sze b records from the dataset and passes the sample to the cleaner, whch executes C( ) for each sample record and outputs ther cleaned versons. The updater uses the cleaned sample to update the weghts of the model, thus movng the model closer to the true cleaned model (n expectaton). Fnally, the system ether termnates due to a stoppng condton (e.g., C( ) has been called a maxmum number of tmes k, or tranng error convergence), or passes control to the sampler for the next teraton Optmzatons In many cases, such as mssng values, errors can be effcently detected. A user provded Detector can be used to dentfy such records that are more lkely to be drty, and thus mproves the lkelhood that the next sample wll contan true drty records. Furthermore, the Estmator uses prevously cleaned data to estmate the effect that cleanng a gven record wll have on the model. These components can be used separately (f only one s suppled) or together to focus the system s cleanng efforts on records that wll most mprove the model. Secton 7 descrbes several nstantatons of these components for dfferent data cleanng problems. Our experments show that these optmzatons can mprove model accuracy by up-to 2.5x (Secton 8.3.2). 4.2 Example The followng example llustrates how a user would apply ActveClean to address the use case n Secton 2.6: EXAMPLE 1. The analyst chooses to use an SVM model, and manually cleans records by hand (the C( )). ActveClean ntally selects a sample of 5 records (the default) to show the analyst. She dentfes a subset of 15 records that are drty, fxes them by normalzng the drug and corporaton names wth the

Therefore, Actve- Clean needs to update θ (d) some dstance γ (Fgure 3B): θ new θ (d) γ φ(θ (d) ) Fgure 2: ActveClean allows users to tran predctve models whle progressvely cleanng data.

5 or rght). For ths class of models, gven a suboptmal pont, the drecton to the global optmum s the gradent of the loss functon. The gradent s a d-dmensonal vector functon of the current model θ (d) and the clean data. Therefore, Actve- Clean needs to update θ (d) some dstance γ (Fgure 3B): θ new θ (d) γ φ(θ (d) ) Fgure 2: ActveClean allows users to tran predctve models whle progressvely cleanng data. The framework adaptvely selects the best data to clean and can optonally (denoted wth dotted lnes) ntegrate wth predefned detecton rules and estmaton algorthms for mproved conference. help of a search engne, and corrects the labels wth typographcal or ncorrect values. The system then uses the cleaned records to update the the current best model and select the next sample of 5. The analyst can stop at any tme and use the mproved model to predct donaton lkelhoods. 5. UPDATES WITH CORRECTNESS Ths secton descrbes an algorthm for relable model updates. The updater assumes that t s gven a sample of data S drty from R drty where S drty has a known samplng probablty p(). Sectons 6 and 7 show how to optmze p( ) and the analyss n ths secton apples for any samplng dstrbuton p( ) >. 5.1 Geometrc Dervaton The update algorthm ntutvely follows from the convex geometry of the problem. Consder the problem n one dmenson (.e., the parameter θ s a scalar value), so then the goal s to fnd the mnmum pont (θ) of a curve l(θ). The consequence of drty data s that the wrong loss functon s optmzed. Fgure 3A llustrates the consequence of the optmzaton. The red dotted lne shows the loss functon on the drty data. Optmzng the loss functon fnds θ (d) that at the mnmum pont (red star). However, the true loss functon (w.r.t to the clean data) s n blue, thus the optmal value on the drty data s n fact a suboptmal pont on clean curve (red crcle). Fgure 3: (A) A model traned on drty data can be thought of as a sub-optmal pont w.r.t to the clean data. (B) The gradent gves us the drecton to move the suboptmal model to approach the true optmum. The optmal clean model θ (c) s vsualzed as a yellow star. The frst queston s whch drecton to update θ (d) (.e., left At the optmal pont, the magntude of the gradent wll be zero. So ntutvely, ths approach teratvely moves the model downhll (transparent red crcle) correctng the drty model untl the desred accuracy s reached. However, the gradent depends on all of the clean data whch s not avalable and ActveClean wll have to approxmate the gradent from a sample of newly cleaned data. The man ntuton s that f the gradent steps are on average correct, the model stll moves downhll albet wth a reduced convergence rate proportonal to the naccuracy of the sample-based estmate. To derve a sample-based update rule, the most mportant property s that sums commute wth dervatves and gradents. The convex loss class of models are sums of losses, so gven the current best model θ, the true gradent g (θ) s: g (θ) = φ(θ) = 1 N N φ(x (c), y (c), θ) ActveClean needs to estmate g (θ) from a sample S, whch s drawn from the drty data R drty. Therefore, the sum has two components the gradent on the already clean data g C whch can be computed wthout cleanng and g S the gradent estmate from a sample of drty data to be cleaned: g(θ) = R clean R g C(θ) + R drty R g S(θ) g C can be calculated by applyng the gradent to all of the already cleaned records: 1 g C(θ) =, y (c), θ) R clean φ(x (c) R clean g S can be estmated from a sample by takng the gradent w.r.t each record, and re-weghtng the average by ther respectve samplng probabltes. Before takng the gradent the cleanng functon C( ) s appled to each sampled record. Therefore, let S be a sample of data, where each S s drawn wth probablty p(): g S(θ) = 1 1 S p() φ(x(c), y (c), θ) S Then, at each teraton t, the update becomes: θ (t+1) θ (t) γ g(θ (t) ) 5.2 Model Update Algorthm To summarze, the algorthm s ntalzed wth θ () = θ (d) whch s the drty model. There are three user set parameters the budget k, batch sze b, and the step sze γ. In the followng secton, we wll provde references from the convex optmzaton lterature that allow the user to approprately select these values. At each teraton t = {1,..., T }, the cleanng s appled to a batch of data b selected from the set of canddate drty records R drty. Then, an average gradent s estmated from the cleaned batch and the model s updated. Iteratons contnue untl k = T b records are cleaned.

6 1. Calculate the gradent over the sample of newly clean data and call the result g S(θ (t) ) 2. Calculate the average gradent over all of the already clean data n R clean = R R drty, and call the result g C(θ (t) ) 3. Apply the followng update rule: θ (t+1) θ (t) γ ( R drty R g S(θ (t) )+ R clean g C(θ (t) )) R 5.3 Analyss wth Stochastc Gradent Descent The update algorthm can be formalzed as a class of very well studed algorthms called Stochastc Gradent Descent. SGD provdes a theoretcal framework to understand and analyze the update rule and bound the error. Mn-batch stochastc gradent descent (SGD) s an algorthm for fndng the optmal value gven the convex loss and data. In mn-batch SGD, random subsets of data are selected at each teraton and the average gradent s computed for every batch. One key dfference wth tradtonal SGD models s that ActveClean apples a full gradent step on the already clean data and averages t wth a stochastc gradent step (.e., calculated from a sample) on the drty data. Therefore, ActveClean teratons can take multple passes over the clean data but at most a sngle cleanng pass of the drty data. The update algorthm can be thought of as a varant of SGD that lazly materalzes the clean value. As data s sampled at each teraton, data s cleaned when needed by the optmzaton. It s well known that even for an arbtrary ntalzaton SGD makes sgnfcant progress n less than one epoch (a pass through the entre dataset) [9]. In practce, the drty model can be much more accurate than an arbtrary ntalzaton as corrupton may only affect a few features and combned wth the full gradent step on the clean data the updates converge very quckly. Settng the step sze γ: There s extensve lterature n machne learnng for choosng the step sze γ approprately. γ can be set ether to be a constant or decayed over tme. Many machne learnng frameworks (e.g., MLLb, Sc-kt Learn, Vowpal Wabbt) automatcally set learnng rates or provde dfferent learnng schedulng frameworks. In the experments, we use a technque called nverse scalng where there s a parameter γ =.1, and at each teraton t decays to γ t = γ. S t Settng the batch sze b: The batch sze should be set by the user to have the desred propertes. Larger batches wll take longer to clean and wll make more progress towards the clean model but wll have less frequent model updates. On the other hand, smaller batches are cleaned faster and have more frequent model updates. There are dmnshng returns to ncreasng the batch sze O( 1 b ). In the experments, we use a batch sze of 5 whch converges fast but allows for frequent model updates. If a data cleanng technque requres a larger batch sze than 5,.e., data cleanng s fast enough that the teraton overhead s sgnfcant compared to cleanng 5 records, ActveClean can apply the updates n smaller batches. For example, the batch sze set by the user mght be b = 1, but the model updates after every 5 records are cleaned. We can dsassocate the batchng requrements of SGD and the batchng requrements of the data cleanng technque Convergence Condtons and Propertes Convergence propertes of batch SGD formulatons have been well studed [11]. Essentally, f the gradent estmate s unbased and the step sze s approprately chosen, the algorthm s guaranteed to converge. In Appendx B, we show that the gradent estmate from ActveClean s ndeed unbased and our choce of step sze s one that s establshed to converge. The convergence rates of SGD are also well analyzed [11, 8, 39]. The analyss gves a bound on the error of ntermedate models and the expected number of steps before achevng a model wthn a certan error. For a general convex loss, a batch sze b, and T teratons, the convergence rate s bounded by O( σ2 bt ). σ 2 s the varance n the estmate of the gradent at each teraton: E( g g 2 ) where g s the gradent computed over the full data f t were fully cleaned. Ths property of SGD allows us to bound the model error wth a monotoncally decreasng functon of the number of records cleaned, thus satsfyng the relablty condton n the problem statement. If the loss n non-convex, the update procedure wll converge towards a local mnmum rather than the global mnmum (See Appendx C). 5.4 Example Ths example descrbes an applcaton of the update algorthm. EXAMPLE 2. Recall that the analyst has a drty SVM model on the drty data θ (d). She decdes that she has a budget of cleanng 1 records, and decdes to clean the 1 records n batches of 1 (set based on how fast she can clean the data, and how often she wants to see an updated result). All of the data s ntally treated as drty wth R drty = R and R clean =. The gradent of a basc SVM s gven by the followng functon: { y x f y x θ 1 φ(x, y, θ) = f y x θ 1 For each teraton t, a sample of 1 records S s drawn from R drty. ActveClean then apples the cleanng functon to the sample: {(x (c), y (c) )} = {C() : S} Usng these values, ActveClean estmates the gradent on the newly cleaned data: g S(θ) = p() φ(x(c), y (c), θ) S ActveClean also apples the gradent to the already clean data (ntally non-exstent): g C(θ) = 1 R clean Then, t calculates the update rule: θ (t+1) θ (t) γ ( R drty R φ(x (c) R clean, y (c), θ) g S(θ (t) ) + R clean R g C(θ (t) )) Fnally, R drty R drty S, R clean R clean + S, and contnue to the next teraton.

7 6. EFFICIENCY WITH SAMPLING The updater receved a sample wth probabltes p( ). For any dstrbuton where p( ) >, we can preserve correctness. ActveClean uses a samplng algorthm that selects the most valuable records to clean wth hgher probablty. 6.1 Oracle Samplng Problem Recall that the convergence rate of an SGD algorthm s bounded by σ 2 whch s the varance of the gradent. Intutvely, the varance measures how accurately the gradent s estmated from a unform sample. Other samplng dstrbutons, whle preservng the sample expected value, may have a lower varance. Thus, the oracle samplng problem s defned as a search over samplng dstrbutons to fnd the mnmum varance samplng dstrbuton. DEFINITION 1 (ORACLE SAMPLING PROBLEM). Gven a set of canddate drty data R drty, r R drty fnd samplng probabltes p(r) such that over all samples S of sze k t mnmzes: E( g S g 2 ) It can be shown that the optmal dstrbuton over records n R drty s probabltes proportonal to: p φ(x (c), y (c), θ (t) ) Ths s an establshed result, for thoroughness, we provde a proof n the appendx (Secton D), but ntutvely, records wth hgher gradents should be sampled wth hgher probablty as they affect the update more sgnfcantly. However, ActveClean cannot exclude records wth lower gradents as that would nduce a bas hurtng convergence. The problem s that the optmal dstrbuton leads to a chcken-and-egg problem: the optmal samplng dstrbuton requres knowng (x (c), y (c) ), however, cleanng s requred to know those values. 6.2 Drty Gradent Soluton Such an oracle does not exst, and one soluton s to use the gradent w.r.t to the drty data: p φ(x (d), y (d), θ (t) ) It turns out that the soluton works reasonably well n practce on our expermental datasets and has been studed n Machne Learnng as the Expected Gradent Length heurstc [31]. The contrbuton n ths work s ntegratng ths heurstc wth statstcally correct updates. However, ntutvely, approxmatng the oracle as closely as possble can result n mproved prortzaton. The subsequent secton descrbes two components, the detector and estmator, that can be used to mprove the convergence rate. Our experments suggest up-to a 2x mprovement n convergence when usng these optonal optmzatons (Secton 8.3.2). 7. OPTIMIZATIONS In ths secton, we descrbe two approaches to optmzaton, the Detector and the Estmator, that mprove the effcency of the cleanng process. Both approaches are desgned to ncrease the lkelhood that the Sampler wll pck drty records that, once cleaned, most move the model towards the true clean model. The Detector s ntended to learn the characterstcs that dstngush drty records from clean records whle the Estmator s desgned to estmate the amount that cleanng a gven drty record wll move the model towards the true optmal model. 7.1 The Detector The detector returns two mportant aspects of a record: (1) whether the record s drty, and (2) f t s drty, what s wrong wth the record. The sampler can use (1) to select a subset of drty records to sample at each batch and the estmator can use (2) estmate the value of data cleanng based on other records wth the same corrupton. ActveClean supports two types of detectors: a pror and adaptve. In former assumes that we know the set of drty records and how they are drty a pror to ActveClean, whle the latter adaptvely learns characterstcs of the drty data as part of runnng ActveClean A Pror Detector For many types of drtness such as mssng attrbute values and constrant volatons, t s possble to effcently enumerate a set of corrupted records and determne how the records are corrupted. DEFINITION 2 (A PRIORI DETECTION). Let r be a record n R. An a pror detector s a detector that returns a Boolean of whether the record s drty and a set of columns e r that are drty. D(r) = ({, 1}, e r) From the set of columns that are drty, fnd the correspondng features that are drty f r and labels that are drty l r. Here s an example ths defnton usng a data cleanng methodology proposed n the lterature. Constrant-based Repar: One model for detectng errors nvolves declarng constrants on the database. Detecton. Let Σ be a set of constrants on the relaton R. In the detecton step, the detector selects a subset of records R drty R that volate at least one constrant. The set e r s the set of columns for each record whch have a constrant volaton. EXAMPLE 3. An example of a constrant on the runnng example dataset s that the status of a contrbuton can be only allowed" or dsallowed". Any other value for status s an error. 7.2 Adaptve Detecton A pror detecton s not possble n all cases. The detector also supports adaptve detecton where detecton s learned from prevously cleaned data. Note that ths learnng" s dstnct from the learnng" at the end of the ppelne. The challenge n formulatng ths problem s that detector needs to descrbe how the data s drty (e.g. e r n the a pror case). The detector acheves ths by categorzng the corrupton nto u classes. These classes are corrupton categores that do not necessarly algn wth features, but every record s classfed wth at most one category. When usng adaptve detecton, the repar step has to clean the data and report to whch of the u classes the corrupted record belongs. When an example (x, y) s cleaned, the repar step labels t wth one of the clean, 1, 2,..., u + 1 classes (ncludng one for not drty"). It s possble that u ncreases each teraton as more types of drtness are dscovered. In

8 many real world datasets, data errors have localty, where smlar records tend to be smlarly corrupted. There are usually a small number of error classes even f a large number of records are corrupted. One approach for adaptve detecton s usng a statstcal classfer. Ths approach s partcularly suted for a small number data error classes, each of whch contanng many erroneous records. Ths problem can be addressed by any classfer, and we use an all-versus-one SVM n our experments. Another approach could be to adaptvely learn predcates that defne each of the error classes. For example, f records wth certan attrbutes are corrupted, a pattern tableau can be assgned to each class to select a set of possbly corrupted records. Ths approach s better suted than a statstcal approach for a large number of error classes or scarcty of errors. However, t reles on errors beng well algned wth certan attrbute values. DEFINITION 3 (ADAPTIVE CASE). Select the set of records for whch κ gves a postve error classfcaton (.e., one of the u error classes). After each sample of data s cleaned, the classfer κ s retraned. So the result s: D(r) = ({1, }, {1,..., u + 1}) Adaptve Detecton Wth OpenRefne: EXAMPLE 4. OpenRefne s a spreadsheet-based tool that allows users to explore and transform data. However, t s lmted to cleanng data that can ft n memory on a sngle computer. Snce the cleanng operatons are coupled wth data exploraton, ActveClean does not know what s drty n advance (the analyst may dscover new errors as she cleans). Suppose the analyst wants to use OpenRefne to clean the runnng example dataset wth ActveClean. She takes a sample of data from the entre dataset and uses the tool to dscover errors. For example, she fnds that some drugs are ncorrectly classfed as both drugs and devces. She then removes the devce attrbute for all records that have the drug name n queston. As she fxes the records, she tags each one wth a category tag of whch corrupton t belongs to. 7.3 The Estmator To get around the problem wth oracle samplng, the estmator wll estmate the cleaned value wth prevously cleaned data. The estmator wll also take advantage of the detector from the prevous secton. There are a number of dfferent approaches, such as regresson, that could be used to estmate the cleaned value gven the drty values. However, there s a problem of scarcty, where errors may affect a small number of records. As a result, the regresson approach would have to learn a multvarate functon wth only a few examples. Thus, hgh-dmensonal regresson ll-suted for the estmator. Conversely, t could try a very smple estmator that just calculates an average change and adds ths change to all of the gradents. Ths estmator can be hghly naccurate as t also apples the change to records that are known to be clean. ActveClean leverages the detector for an estmator between these two extremes. The estmator calculates average changes feature-by-feature and selectvely corrects the gradent when a feature s known to be corrupted based on the detector. It also apples a lnearzaton that leads to mproved estmates when the sample sze s small. We evaluate the lnearzaton n Secton 8.5 aganst alternatves, and fnd that t provdes more accurate estmates for a small number of samples cleaned. The result s a based estmator, and when the number of cleaned samples s large the alternatve technques are comparable or even slghtly better due to the bas. Estmaton For A Pror Detecton. If most of the features are correct, t would seem lke the gradent s only ncorrect n one or two of ts components. The problem s that the gradent φ( ) can be a very nonlnear functon of the features that couple features together. For example, the gradent for lnear regresson s: φ(x, y, θ) = (θ T x y)x It s not possble to solate the effect of a change of one feature on the gradent. Even f one of the features s corrupted, all of the gradent components wll be ncorrect. To address ths problem, the gradent can be approxmated n a way that the effects of drty features on the gradent are decoupled. Recall, n the a pror detecton problem, that assocated wth each r R drty s a set of errors f r, l r whch s a set that dentfes a set of corrupted features and labels. Ths property can be used to construct a coarse estmate of the clean value. The man dea s to calculate average changes for each feature, then gven an uncleaned (but drty) record, add these average changes to correct the gradent. To formalze the ntuton, nstead of computng the actual gradent wth respect to the true clean values, compute the condtonal expectaton gven that a set of features and labels f r, l r are corrupted: p E( φ(x (c), y (c), θ (t) ) f r, l r) Corrupted features are defned as that: / f r = x (c) [] x (d) [] = / l r = y (c) [] y (d) [] = The needed approxmaton represents a lnearzaton of the errors, and the resultng approxmaton wll be of the form: p(r) φ(x, y, θ (t) ) + M x rx + M y ry where M x, M y are matrces and rx and ry are vectors wth one component for each feature and label where each value s the average change for those features that are corrupted and otherwse. Essentally, t the gradent wth respect to the drty data plus some lnear correcton factor. In the appendx, we present a dervaton usng a Taylor seres expanson and a number of M x and M y matrces for common convex losses (Appendx E and F). The appendx also descrbes how to mantan rx and ry as cleanng progresses. Estmaton For Adaptve Case. A smlar procedure holds n the adaptve settng, however, t requres reformulaton. Here, ActveClean uses u corrupton classes provded by the detector. Instead of condtonng on the features that are corrupted, the estmator condtons on the classes. So for each error class, t computes a ux and uy. These are the average change n the features gven that class and the average change n labels gven that class. p(r u) φ(x, y, θ (t) ) + M x ux + M y uy Example Here s an example of usng the optmzaton to select a sample of data for cleanng.

9 EXAMPLE 5. Consder usng ActveClean wth an a pror detector. Let us assume that there are no errors n the labels and only errors n the features. Then, each tranng example wll have a set of corrupted features (e.g., {1, 2, 6}, {1, 2, 15}). Suppose that the cleaner has just cleaned the records r 1 and r 2 represented as tuples wth ther corrupted feature set: (r 1,{1, 2, 3}), (r 2,{1, 2, 6}). For each feature, ActveClean mantans the average change between drty and clean n a value n a vector x[] for those records corrupted on that feature. Then, gven a new record (r 3,{1, 2, 3, 6}), r3 x s the vector x where component s set to f the feature s not corrupted. Suppose the data analyst s usng an SVM, then the M x matrx s as follows: { y[] f y x θ 1 M x[, ] = f y x θ 1 Thus, we calculate a samplng weght for record r 3: p(r 3) φ(x, y, θ (t) ) + M x r3 x To turn the result nto a probablty dstrbuton, ActveClean normalzes over all drty records. 8. EXPERIMENTS Frst, the experments evaluate how varous types of corrupted data beneft from data cleanng. Next, the experments explore dfferent prortzaton and model update schemes for progressve data cleanng. Fnally, ActveClean s evaluated end-to-end n a number of real-world data cleanng scenaros. 8.1 Expermental Setup and Notaton The man metrc for evaluaton s a relatve measure of the traned model and the model f all of the data s cleaned. Relatve Model Error. Let θ be the model traned on the drty data, and let θ be the model traned on the same data f t was cleaned. Then the model error s defned as θ θ. θ Scenaros Income Classfcaton (Adult): In ths dataset of 45,552 records, the task s to predct the ncome bracket (bnary) from 12 numercal and categorcal covarates wth an SVM classfer. Sezure Classfcaton (EEG): In ths dataset, the task s to predct the onset of a sezure (bnary) from 15 numercal covarates wth a thresholded Lnear Regresson. There are 1498 data ponts n ths dataset. Ths classfcaton task s nherently hard wth an accuracy on completely clean data of only 65%. Handwrtng Recognton (MNIST) 1 : In ths dataset, the task s to classfy 6, mages of handwrtten mages nto 1 categores wth an one-to-all multclass SVM classfer. The unque part of ths dataset s the featurzed data conssts of a 784 dmensonal vector whch ncludes edge detectors and raw mage patches. Dollars For Docs: The dataset has 24,89 records wth 5 textual attrbutes and one numercal attrbute. The dataset s featurzed wth bag-of-words featurzaton model for the textual attrbutes whch resulted n a 221 dmensonal feature 1 Dataset vector, and a bnary SVM s used to classfy the status of the medcal donatons. World Bank: The dataset has 193 records of country name, populaton, and varous macro-economcs statstcs. The values are lsted wth the date at whch they were acqured. Ths allowed us to determne that records from smaller and less populous countres were more lkely to be out-of-date Compared Algorthms Here are the alternatve methodologes evaluated n the experments: Robust Logstc Regresson [14]. Feng et al. proposed a varant of logstc regresson that s robust to outlers. We chose ths algorthm because t s a robust extenson of the convex regularzed loss model, leadng to a better apples-toapples comparson between the technques. (See detals n Appendx H.1) Dscardng Drty Data. As a baselne, drty data are dscarded. SampleClean (SC) [33]. SampleClean takes a sample of data, apples data cleanng, and then trans a model to completon on the sample. Actve Learnng (AL) [18]. To farly evaluate Actve Learnng, we frst apply our gradent update to ensure correctness. Wthn each teraton, examples are prortzed by dstance to the decson boundary (called Uncertanty Samplng n [31]). However, we do not nclude our optmzatons such as detecton and estmaton. ActveClean Oracle (AC+O): In ActveClean Oracle, nstead of an estmaton and detecton step, the true clean value s used to evaluate the theoretcal deal performance of Actve- Clean. 8.2 Does Data Cleanng Matter? The frst experment evaluates the benefts of data cleanng on two of the example datasets (EEG and Adult). Our goal s to understand whch types of data corrupton are amenable to data cleanng and whch are better suted for robust statstcal technques. The experment compares four schemes: (1) full data cleanng, (2) baselne of no cleanng, (3) dscardng the drty data, and (4) robust logstc regresson,. We corrupted 5% of the tranng examples n each dataset n two dfferent ways: Random Corrupton: Smulated hgh-magntude random outlers. 5% of the examples are selected at random and a random feature s replaced wth 3 tmes the hghest feature value. Systematc Corrupton: Smulated nnocuous lookng (but stll ncorrect) systematc corrupton. The model s traned on the clean data, and the three most mportant features (hghest weghted) are dentfed. The examples are sorted by each of these features and the top examples are corrupted wth the mean value for that feature (5% corrupton n all). It s mportant to note that examples can have multple corrupted features. Fgure 4 shows the test accuracy for models traned on both types of data wth the dfferent technques. The robust method performs well on the random hgh-magntude outlers wth only a 2.% reducton n clean test accuracy for EEG and 2.5% reducton for Adult. In the random settng, dscardng drty data also performs relatvely well. However,

Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (a) Randomly Corrupted Data EEG Adult Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (b) Systematcally Corrupted Data Fgure 4: (a) Robust technques and

10 Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (a) Randomly Corrupted Data EEG Adult Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (b) Systematcally Corrupted Data Fgure 4: (a) Robust technques and dscardng data work when corrupted data are random and look atypcal. (b) Data cleanng can provde relable performance n both the systematcally corrupted settng and randomly corrupted settng. the robust method falters on the systematc corrupton wth a 9.1% reducton n clean test accuracy for EEG and 1.5% reducton for Adult. The problem s that wthout cleanng, there s no way to know f the corrupton s random or systematc and when to trust a robust method. Whle data cleanng requres more effort, t provdes benefts n both settngs. In the remanng experments, unless otherwse noted, the experments use systematc corrupton. Summary: A 5% systematc corrupton can ntroduce a 1% reducton n test accuracy even when usng a robust method. 8.3 ActveClean: A Pror Detecton The next set of experments evaluate dfferent approaches to cleanng a sample of data compared to ActveClean usng a pror detecton. A pror detecton assumes that all of the corrupted records are known n advance but ther clean values are unknown Actve Learnng and SampleClean The next experment evaluates the samples-to-error tradeoff between four alternatve algorthms: ActveClean (AC), SampleClean, Actve Learnng, and ActveClean +Oracle (AC+O). Fgure 5 shows the model error and test accuracy as a functon of the number of cleaned records. In terms of model error, ActveClean gves ts largest benefts for small sample szes. For 5 cleaned records of the Adult dataset, Actve- Clean has 6.1x less error than SampleClean and 2.1x less error than Actve Learnng. For 5 cleaned records of the EEG dataset, ActveClean has 9.6x less error than SampleClean and 2.4x less error than Actve Learnng. Both Actve Learnng and ActveClean beneft from the ntalzaton wth the drty model as they do not retran ther models from scratch, and ActveClean mproves on ths performance wth detecton and error estmaton. Actve Learnng has no noton of drty and clean data, and therefore prortzes wth respect to the drty data. These gans n model error also correlate well to mprovements n test error (defned as the test accuracy dfference w.r.t cleanng all data). The test error converges more quckly than model error, emphaszng the benefts of progressve data cleanng, snce t s not neccessary to clean all the data to get a model wth essentally the same performance as the clean model. For example, to acheve a test error of 1% on the Adult dataset, ActveClean cleans 5 fewer records than Actve Learnng. Summary: ActveClean wth a pror detecton returns results that are more than 6x more accurate than SampleClean and 2x more accurate than Actve Learnng for cleanng 5 records. EEG Adult Model Error % Test Error % (a) Adult # Records Cleaned Model Error % Test Error % (b) EEG # Records Cleaned Fgure 5: The relatve model error as a functon of the number of examples cleaned. ActveClean converges wth a smaller sample sze to the true result n comparson to Actve Learnng and SampleClean. Fgure 6: -D denotes no detecton, and -D-I denotes no detecton and no mportance samplng. Both optmzatons sgnfcantly help ActveClean outperform Sample- Clean and Actve Learnng Source of Improvements The next experment compares the performance of Actve- Clean wth and wthout varous optmzatons at 5 records cleaned pont. ActveClean wthout detecton s denoted as (AC-D) (that s at each teraton we sample from the entre drty data), and ActveClean wthout detecton and mportance samplng s denoted as (AC-D-I). Fgure 6 plots the relatve error of the alternatves and ActveClean wth and wthout the optmzatons. Wthout detecton (AC-D), ActveClean s stll more accurate than Actve Learnng. Removng the mportance samplng, ActveClean s slghtly worse than Actve Learnng on the Adult dataset but s comparable on the EEG dataset. Summary: Both a pror detecton and non-unform samplng sgnfcantly contrbute to the gans over Actve Learnng Mxng Drty and Clean Data Tranng a model on mxed data s an unrelable methodology lackng the same guarantees as Actve Learnng or SampleClean even n the smplest of cases. For thoroughness, the next experments nclude the model error as a functon of records cleaned n comparson to ActveClean. Fgure 7 plots the same curves as the prevous experment comparng ActveClean, Actve Learnng, and two mxed data algorthms. PC randomly samples data, clean, and wrtes-back the cleaned data. PC+D randomly samples data from usng the drty data detector, cleans, and wrtes-back the cleaned data. For these errors PC and PC+D gve reasonable results

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust