Evaluating Alignment Methods in Dynamic Microsimulation Models 1. Jinjing Li 2. Maastricht University. Cathal O'Donoghue 3

Evaluatng Algnment Methods n Dynamc Mcrosmulaton Models 1 Jnjng L 2 Maastrcht Unversty Cathal O'Donoghue 3 Rural Economy and Development Programme, Teagasc Abstract: Algnment s a wdely adopted technque n the feld of mcrosmulaton for socal and economc polcy research. However, lmted research has been devoted to the understandng of ther smulaton propertes. Ths paper dscusses and evaluates sx common algnment algorthms used n the dynamc mcrosmulaton through a set of theoretcal and statstcal crtera proposed n the earler lterature (e.g. Morrson 2006; O Donoghue 2010). Ths paper presents and compares the algnment processes, probablty transformatons, and the statstcal propertes of algnment outputs n transparent and controlled setups wth both synthetc and real lfe dataset (LII). The result suggests that there s no sngle best method for all smulaton scenaros. Instead, the choce of algnment method mght need to be adapted to the assumptons and requrements n a specfc project. Key words: algnment, mcrosmulaton, algorthm evaluaton 1 The authors are grateful to Rck Morrson, Howard Redway and Steven Caldwell for helpful dscussons over tme n relaton to algnment n mcrosmulaton models. We are grateful to the Luxembourg AFR for supportng ths research. 2 Emal address: Jnjng.L@maastrchtunversty.nl 3 Emal address: Cathal.odonoghue@teagasc.e 1

Evaluatng Algnment Methods n Dynamc Mcrosmulaton Models I. INTRODUCTION Mcrosmulaton s a technque used to model complex real lfe events by smulatng the actons and the mpact of polcy change on the ndvdual mcro unt. (Hardng, 2007) Mcrosmulaton models are usually categorsed nto statc or dynamc. Statc models, e.g. EUROMOD (Mantovan et al., 2007), are often arthmetc models that evaluate the mmedate dstrbutonal mpact upon ndvduals/households of possble polcy changes. Dynamc models, e.g. DESTINIE, PENSIM, SESIM (Bardaj et al., 2003, Curry, 1996, Flood, 2007), extend the statc model by allowng the ndvduals to change ther characterstcs as a result of endogenous factors wthn the model (O Donoghue, 2001). Usng ths method, t s possble to generate new smulated populatons that can be used for polcy and scenaro analyss. Dynamc mcrosmulaton models typcally smulate behavoural processes such as demographc (e.g. marrage), labour market (e.g. unemployment) and ncome characterstcs (e.g. wage). The method uses statstcal estmates of these systems of equatons and then apples Monte Carlo smulaton technques to generate the new populatons, typcally over tme, both nto the future and when creatng hstores wth partal data, nto the past. As statstcal models are typcally estmated on hstorcal datasets wth specfc characterstcs and perod effects, projectons of the future may therefore contan error or may not correspond to exogenous expectatons of future events. In addton, the complexty of mcro behavour may mean that smulaton models may over or under predct the occurrence of a certan event, even n a well-specfed model (Duncan and Weeks, 1998). Because of these ssues, methods of calbraton known as algnment have been developed wthn the mcrosmulaton lterature to correct for ssues related to the adequacy of mcro projectons. Scott (2001) defnes algnment as a process of constranng model output to conform more closely to externally derved macro-data ('targets'). There are both arguments for and aganst algnment procedures (Baekgaard, H., 2002). Concerns drected towards algnment manly focus on the consstency ssue wthn the estmates and the level of dsaggregaton at whch ths should occur. It s suggested that equatons should be reformulated rather than constraned ex post. Clearly, n an deal world, one would try to estmate a system of equatons that could replcate realty and have effectve future projectons wthout the need for algnment. However, as Wnder (2000) stated, mcrosmulaton models usually fal to smulate known tme-seres data. By algnng the model, goodness of ft to an observed tme seres can be guaranteed. Some modellers suggest that algnment s an effectve pragmatc soluton for hghly complex models. (O Donoghue, 2010) Over the past decade, algnng the output of a mcrosmulaton model to exogenous assumptons has become standard despte ths controversy. In order to meet the need of algnment, varous methods, e.g. multplcatve scalng, sdewalk, sortng based algorthm etc., have been expermented along wth the development of mcrosmulaton (See Morrson, 2006). Mcrosmulaton models usng hstorcal datasets, e.g. CORSIM, algn the output to hstorcal data to create a more credble profle (SOA, 2001). Models that work prospectvely, e.g. APPSIM, also utlse the technque to algn ther smulaton wth external projectons (Kelly and Percval, 2009). 2

Nonetheless, the understandng of the smulaton propertes of algnment n mcrosmulaton models s very lmted. Lterature on ths topc are scarce, wth a few exceptons such as Anderson (1990), Caldwell et al. (1998), Neufeld (2000), Chénard (2000a, 2000b), Johnson (2001), Baekgaard (2002), Morrson (2006), Kelly and Percval (2009) and O Donoghue (2010). Although some new algnment methods were developed n an attempt to address some theoretcal and emprcal defcences of earler methods, dscussons on emprcal smulaton propertes of dfferent algnment algorthms are almost non-exstent. Ths paper ams to fll ths gap and better understand the smulaton propertes of algnment algorthms n mcrosmulaton. It evaluates all major bnary algnment methods usng a smple mcrosmulaton model wth a set of synthetc datasets and a real lfe dataset. It compares the algnment processes, probablty transformatons, and the statstcal propertes of algnment outputs n transparent and controlled setups. In addton, a real lfe panel dataset, Lvng n Ireland (LII), s used together wth a smplfed mcrosmulaton model to evaluate the algnment performances n typcal mcrosmulaton project setup. Algnment performances are tested usng varous evaluaton crtera, ncludng the ones outlned n Morrson (2006). The present paper s dvded nto 6 sectons. In the next secton, we wll revew the background to the algnment methodology used n mcrosmulaton and summarzes the exstng algorthms used n varous models. Secton 3 dscusses the objectves of algnment and the method of algorthm evaluaton. Secton 4 descrbes the detal of the datasets used n the evaluaton process and some key statstcs. We wll present the results of the evaluaton n secton 5, and conclude n the last secton. II. ALIGNMENT IN MICROSIMULATION Ths secton dscusses the purpose of algnment n a mcrosmulaton model and the common practse of ther statstcal mplementaton. Baekgaard (2000) suggests two broad categores for algnment: parameter algnment, whereby the dstrbuton functon s changed by adjustment of ts parameters; and ex post algnment, whereby algnment s performed on the bass of unadjusted predctons or nterm output from a smulaton. Ths paper prmarly focuses on the ex post algnment methods, as they are the most common form of algnments n mcrosmulaton. Models of contnuous events such as the level of earnngs or nvestment ncome utlse statstcal regressons wth contnuous dependent varables and produce a dstrbuton of contnuous values. However, the predcton of the statstcal model may devate from the expectaton for example due to an expected change n the dstrbuton or productvty or may need to be adjusted for scenaro analyss. Ths rases the need for algnment, whch s often may be an adjustment of multplcatve appled contnuous varables or va adjustng the error dstrbuton (Chénard, 2000a). For bnary varables however, one cannot not apply the same method, as bnary varable smulaton uses dscrete choce models such as logt, probt or multnomal logt models and the outputs cannot be adjusted n ths way lke contnuous varables. As the majorty 3

of processes, e.g. n-work, employment, health, retrement, etc., n dynamc mcrosmulaton models are bnary choce n nature, ths paper focus ts attenton on the algnment of bnary choce models. Models of dscrete events such as n-work, employment status, dsablty status etc. are typcally produce probabltes of the event occurrng as output. These models can be expressed n the followng generc form: f p X (1) As seen, equaton 1 can be dvded nto a determnstc component X and a stochastc component. In a smple Monte Carlo smulaton, we generate the random number *, adjust the model for endogenous changes n the explanatory varables to * produce a new determnstc component X and smulate a new dependent varable. In the case of a bnary choce we produce 4 : * * * f p X (2) * The dependent varable s predcted to have a value 1 f f p 0 and 0 otherwse 5. In most cases, a mcrosmulaton model apples ths predcton process to all observatons ndvdually wthout nteracton. However, ths may lead to a potental sde effect: The output of the predcaton, although t may look reasonable at each ndvdual level, may not meet the modeller s expectaton at the aggregate level. For nstance, the smulated average earnng mght be hgher or lower than the assumpton, or the n-work rate s beyond the expectaton. Therefore, algnment s ntroduced as the step after the ntal predcton n order to correct ths error. Although the theoretcal debate of algnment s not over, algnment s de facto wdely adopted n the models bult or updated wthn last decade, e.g. DYANACAN (Neufeld, 2000), CORSIM (SOA, 1997), APPSIM (Bacon, 2009). Many papers, e.g. Baekgaard (2002), Bacon (2009) and O Donoghue (2010), have dscussed the man reasons for algnment, and summarse them as follows: Algnment may be used to repar the unfortunate consequences of nsuffcent estmaton data by ncorporatng addtonal nformaton n the smulatons. Snce no country has an deal dataset for estmatng all the parameters needed for mcrosmulaton, modellers often make compromses, whch adversely affects the output qualty. Algnment can be used to fx some of these errors. Algnment can be used to adjust for poor predctve performance of the mcro model or ts msspecfcaton. Even wth perfect data, relatonshps between dependent varables and explanatory varables may change consderably n countres where substantal structural changes are takng place. Algnment allows * 4 * * p Note f p n the case of a logt model s defned as f p ln 1 * p 5 A more detaled descrpton of logt based dscrete model n mcrosmulaton can be found n O Donoghue (2010) 4

one to correct for these ssues and make the smulaton consstent wth holstc projecton assumptons. Algnment provdes an opportunty for producng scenaros based on dfferent assumptons. Examples nclude the smulaton of alternatve recesson scenaros on employment wth dfferent mpacts on dfferent socal groups (e.g. sex, educaton or occupaton) Algnment s nstrumental n establshng lnks between mcrosmulaton models of the household sector and the macro models. It s a crucal step to reach a consstent Mcro-Macro smulaton model (see Daves 2004). Algnment can be used to reduce Monte Carlo varablty though ts determnstc calculaton (Neufeld, 2000). Ths s partcularly useful for small samples to confne the varablty of aggregate statstcs. Algnment Methods In order to calbrate a smulaton of a bnary varable, we need a method that can adjust the outcome of a logt or probt model to produce outcomes that are consstent wth the external total. At the moment, there s no standardsed method for mplementng algnment n mcrosmulaton. Gven that dfferent modellers may have dfferent vews or needs, t s not surprsng that varous bnary algnment methods have appeared. Papers by Neufeld (2000), Morrson (2006) and O Donoghue (2010) provde descrptons on some popular optons for algnment used n the lterature. Exstng documented algnment methods nclude Multplcatve Scalng Sdewalk Shuffle, Sdewalk Hybrd and ther dervatves Central Lmt Theorem Approach Algnment by Sortng (wth dfferent sortng varables) Multplcatve scalng, whch was descrbed n Neufeld (2000), nvolves undertakng an unalgned smulaton usng Monte Carlo technques and then comparng the proporton of transtons wth the external control total. The rato between the desred transton rate and the actual transton s calculated and appled n a second pass to the smulated probabltes. The method, however, s crtczed by Morrson (2006) as probabltes are not lmted to the range 0-1, although the problem s rare n practce as the multplcatve rato tends to be small. Neufeld (2000) suggests solutons to ths may nclude usng nonlnear adjustment. The sdewalk method was frst ntroduced n Neufeld (2000) as a varance reducton technque, whch was also used as an alternatve to pure Monte Carlo smulaton. It reduces the possblty of unlkely smulated outcomes because of the use of random numbers. The orgnal method, however, does not algn the smulated data to an external control. It smply nvolves accumulatng a runnng total of predcted probabltes. Once the accumulaton exceeds 1, a transton occurs. Therefore, t elmnates the use of random numbers as a varance reducton technque. Nevertheless, the method has some dffcultes n output replcatons when the order observatons changes. The order of the observatons may be altered due to the deleton of an observaton (e.g. deaths) or other changes. Seral correlaton wthn famles (or other clusterng unt) s also an ssue as people wthn the cluster are smulated n order. It s therefore unlkely for two people 5

wthn a famly to be smulated to make a transton n one year f the transtonal probabltes are low. Neufeld (2000) further developed an algnment method that he characterzed as a hybrd of ndependent Monte Carlo smulaton and the sdewalk method. DYNACAN adopted ths method wth non-lnear adjustment to the equaton-generated probabltes, combned wth a mnor tweakng of the resultng probabltes dependng on whether the smulated rate s ahead of or behnd the target rate for the pool durng the progress and some randomsatons. (Morrson, 2006). The method calbrates the probabltes through the logt transformaton nstead usng probabltes drectly n order to assure the values are bounded between 0 and 1. (SOA, 1998) Sdewalk Hybrd method requres two key parameters, whch decdes how smlar the output s to standard Monte Carlo or standard sdewalk method. The Central Lmt Theorem approach s descrbed n Morrson (2006). It utlses the assumpton that the mean smulated probablty s close to the expected mean when N s large. It manpulates the probabltes of each ndvdual observaton on the fly so that the smulated mean matches the expectaton. A more detaled descrpton of the method can be found n Morrson (2006). As all the methods we have dscussed so far, ths method does not need any sortng routne. Algnment by sortng was frst documented by O Donoghue (2001) and Johnson (2001). It nvolves sortng of the predcted probablty adjusted wth a stochastc component, and selects desred number of events accordng to the sortng order. It s seen as a more transparent method (O Donoghue, 2010) although computatonally more ntensve due to the sortng procedure. Many varatons of the methods have been used n the past years and we wll dscuss the mostly used three algorthms n ths paper: Sort on predcted probablty (SOP), Sort on the dfference between predcted probablty and random number (SOD), and Sort on the dfference between logstc adjusted predcted probablty and random number (SODL). Sort on predcted probablty (SOP) Assumng that the predcted probablty from a logt model can be defned as: p * * exp X 1 exp X * (3) * p s the predcted probablty, both and are estmated coeffcents. Ths method essental pcks up the observatons wth hghest p * n each algnment pool. One consequence, however, s that those wth the hghest rsk are always beng selected for transton. In the example of n-work, the hgher educated, all other thngs beng equal would be selected to have a job. In realty those wth the hghest rsk wll on average be selected more than those wth lower rsk, but not always be selected. As a result some varablty needs to be ntroduced. Kelly and Percval (2009) propose a varant of ths 6

method, where a proporton (typcally 10% of the desred number) are selected when the sortng order s nverted, so as to allow low rsk unts to make a transton. Sort on the dfference between predcted probablty and random number (SOD) Gven the shortcomng of the smple probablty sortng, Baekgaard (2002) uses another method, whch sorts by dfferences between predcted probablty and a random number. * Instead of sortng the probablty p drectly, t sorts r, whch equals to the dfference * between p and a random number u, a number that s unformly dstrbuted between 0 and 1. Mathematcally, ths sortng varable can be defned as follows: -1 exp X r logt Xu u 1exp X (4) A concern about ths method s that the range of possble sortng values s not the same for each pont. In other words, because the random number u [0,1] s subtracted from the determnstcally predcted p*, and the sortng value takes the range r [ 1,1]. For each ndvdual, r wll only take a possble range r [u1,u ]. As a result, when p* s small say 0.1, the range of possble sortng values s [-0.9, 0.1]. At the other extreme f p* s large say = 0.9, then the range of possble sortng values s [-0.1, 0.9]. Thus because there s only a small overlap for these extreme ponts, an ndvdual wth a small p* wll have a very low chance of beng selected even f a low value random number s pared wth the observaton. Ideally the range of possble sortng values should be the same, so that for each ndvdual, r [ a,b], wth ndvduals wth a low p* beng clustered towards the bottom and those wth a hgh p* beng clustered towards the top. Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) An alternatve method descrbed n O Donoghue et al. (2008) and Morrson (2006) tres to mtgate the above problem by usng logstc transformaton. Ths method takes a predcted logstc varable from a logt model, logt( p) Xcombned wth a random number that s drawn from a logstc dstrbuton to produce a randomsed varable: -1 p logt X (5) p s then used to sort ndvduals and smlarly the top n j of households are selected. The sortng varable can therefore be descrbed as follows: -1 exp( X ) r logt ( X ) 1 exp( X ) (6) s a logstcally dstrbuted random number wth mean value 0 and a standard error of / 3.Snce the random number s not unformly dstrbuted as u n the prevous method, t produces a dfferent sortng order. III. METHODS OF EVALUATING ALIGNMENT ALGORITHM 7

In order to evaluate the smulaton propertes of all algnment algorthms, t s mportant to defne what we need to compare, and what the crtera are. Although dfferent algnment methods have been brefly documented n a few papers, there s lttle dscusson on the actual performance dfferences among these methods. Implementatons vary from model to model, but no paper so far valdates the algnment methods. Ths paper tres to evaluate dfferent algorthms and compares how they perform under dfferent scenaros. Objectves of Algnment The objectves of algnment, dscussed n Morrson (2006) and O Donoghue (2010) serve as the bass of our evaluaton crtera. From a practcal pont of vew, a good algnment algorthm should be able to a) Replcate as close as possble the external control totals for the algnment totals. Ths s one of the man reasons why algnment s mplemented n mcrosmulaton and the common goal of all algnment methods as dscussed vrtually all algnment papers, e.g. Neufeld (2000), Morrson (2006) b) Retan the relatonshp between the determnstc and explanatory varables n the determnstc component of the model (O Donoghue 2010). In achevng the external totals, the algnment process should not bas the underlyng relatonshp between the dependent and explanatory varables. c) Retan the shape of dstrbutons n dfferent subgroup and nter-relatons unless there s a reason not to do t. Morrson (2006) suggests that algnment s about mplementng the rght numbers of events n the rght proportons for a pool s prospectve events, as opposed to smply gettng the rght expected numbers of events. Although algnment processes focus on the aggregated output, t should not sgnfcantly dstort the relatve dstrbuton wthn dfferent sub-groups. For nstance, f we want to algn the number of people n work, we not only want to get the numbers rght at the aggregate level, but also at the mcro/meso level, e.g. the labour partcpaton rate for 30 years old should be hgher than the rate for the 80 years old. Ths relatve dstrbuton should not be changed, at least substantally, by the algnment method. A hghly dstorted algnment process would adversely affect the dstrbutonal analyss, a typcal usage of mcrosmulaton models. d) Compute effcently. There s no doubt that today s computng resources have been more much more abundant that ever. However, when handlng large dataset, e.g. full populaton dataset, computatonal constrant s stll an mportant ssue. Some projects, e.g. LIAM2/MDaL (Legeos, 2010), redesgn the entre framework n order to acheve faster speed and accommodate larger datasets. Indcators of algnment performance In order to assess the algnment algorthms wth very dfferent desgns, the paper uses a set of quanttatve ndcators that can measure the smulaton propertes accordng to the crtera dscussed earler. The ndcators nclude A general ft measure: a false postve rate Pr( Y 1 0) and a false negatve rate Pr( Y 0 1), whch reflect how well the predcton ft the actual data n general. 8

A target devaton ndex (TDI), whch measures the dfference between the external control and the smulaton outcome. Ths ndcator s drectly lnked to the frst crteron. A dstrbuton devaton ndex (DDI), whch measures the dstorton of the relatonshp between dfferent varables and nter-relatons, as dscussed n crtera two and three. And a computatonal effcency measurement: The number of seconds t takes to execute one round of algnment as outlned n crteron four. Target Devaton Index (TDI) Assumng among N observatons, the deal number of events s T and the actual smulated number of events after algnment s S. Target Devaton Index (TDI) s defned as TDI T S N (7) It s a percentage number ranged 0 to 1, and shows how the algnment replcates the external control. Hgher values mply the outcome s further away from the external control. It s a straghtforward ndcator to evaluate the frst crteron. Dstrbuton devaton ndex (DDI) In order to evaluate the second and the thrd crtera, t s necessary to fnd an ndcator that can reflect how well the relatonshps are preserved and how dfferent the new dstrbuton s from the old one. A frst method could be to compare the orgnal coeffcents wth re-estmated coeffcents from algned data. Statstcally dentcal coeffcents ndcate that the relatonshp remans the same, at least mathematcally. However, ths mght not be appled to algnment tests as algnment tself, by defnton, dstorts the orgnal probabltes. The coeffcents, as a result, are bound to change even under an optmal algnment, and n most cases, the correct algned coeffcents are not avalable. A second method to compare the relatonshps s to see whether the dstrbuton of key varables have changed after algnment, e.g. whether the proporton of male workers and females workers have changed substantally. A Ch-square test could be useful for ths scenaro, as t s frequently used to test whether the observed dstrbuton follows the theoretcal dstrbuton. It s defned as n 2 ( O E) E 1 2 (8) Nevertheless, the test tself s not desgned for bnary values and requres "no more than 20% of the expected counts to be less than 5 and all ndvdual expected counts are 1 or greater" (Yates, Moore & McCabe, 1999). Ths requrement mght not be always fulflled n mcrosmulaton dependng on the scenaro assumptons and the way groups are defned. As a result, an adaptaton s requred n order to best measure the devaton between two dstrbutons for the purpose of bnary varables and possbly low or zero expected counts. 9

Ths paper proposes a self-defned dstrbuton devaton ndex (DDI) to evaluate the second and thrd crtera n choosng an algnment method. Assumng we are gong to evaluate the dstrbuton dstorton n a sngle algnment pool va a groupng varable X. X could be anythng lke age, gender, or age gender nteracton etc. N Observatons are dvded among nx ( ) cells. S s the mean value of events occurrence after algnment n group, and O s the observed value n the base dataset. If we defne R as the algnment rato used n the algnng process, ORwould represent the expected value after algnment. A dstrbuton devaton ndex (DDI), therefore, can be defned as N DDI S OR N n ( X ) 1 2 (9) Ths ndcator descrbes how well the mcro-smulated data retan the relatonshps between dependent varable and varable X. It s a mnmum dstance estmaton talored for bnary varable outcome n a smulaton. Essentally, DDI calculates the sum of squares of dfferences weghted by the number of observatons. It measures the dfferences between dstrbutons before and after algnment n multple dmensons, dependng on the vector X. When X s an ndependent varable, t measures the dstorton ntroduced between the ndependent varable and the dependent by algnment. When X s the dependent varable, DDI reports the degree of nonlnearty n the probablty dstorton of algnment. When X s a varable outsde of the equaton, DDI assesses the level of dstorton n an mplct relatonshp. In short, X could be a vector consstng of any varable and nteracton terms. The ndcator s postvely correlated wth the algnment devaton, t ncreases when the algned dstrbuton departs from the orgnal and decreases when the dstrbutons are gettng alke. The scale of the ndcator s ndependent to the choce of varable X and the number of groups that X may produce. Snce S and O are both probabltes between 0 and 1. DDI has a range of 0 to 1. When the dataset preserves the shape of dstrbuton perfectly, the ndex has a value of 0. It ncreases when the dfference of two redstrbutons grows, wth a maxmum value of 1. Computaton effcency The most ntutve ndcator for the computatonal effcency of an algnment algorthm s the executon tme: the length of tme an algnment method takes to execute one round of algnment wth nput n randomsed order. In order to have comparable nputs and outputs, all methods are requred to retan the ntal order of nputs. Ths makes the algorthm ready as a module n the mcrosmulaton model. However, ths extra requrement penalzes the speed of the methods that requre randomly shufflng, as the observatons need to be re-sorted before the end of the executon. The evaluaton of the computatonal effcency s performed n Stata because of ts easy ntegraton of estmaton and smulaton. Gven that the computer speed vares much, the results presented n ths paper may change dramatcally on a dfferent platform although we would expect the relatve rankng to reman stable n most cases. Algnment algorthms evaluated 10

Ths paper evaluates all algnment algorthms dscussed earler, whch ncludes, Multplcatve scalng Sdewalk Hybrd wth Nonlnear Adjustment Central Lmt Theorem Approach Sort on predcted probablty (SOP) Sort on the dfference between predcted probablty and random number (SOD) Sort on the dfference between logstc adjusted predcted probablty and random number. (SODL) When mplementng Sdewalk Hybrd wth Nonlnear Adjustment, there are two mportant parameters requred, η and λ. η s the maxmum allowed dfference between the actual number of events and the expected number of events before λ s added or subtracted from predcted probablty. In ths paper, η s set to 0.5 and λ s set to 0.03, whch are the same values that DYANCAN model used. (Neufeld, 2000) The order of ntal nput s shuffled n order to get rd of undesred seral correlaton. IV. DATASETS AND SCENARIOS IN ALIGNMENT ALGORITHM EVALUATION In order to understand the smulaton propertes of algnment algorthms, ths paper evaluates the performances of varous methods under two settngs, a lab settng, where synthetc dataset s used, and a real-world settng, where the algorthms are appled to a real world dataset. Ths setup makes t possble to examne the performances of the algnment methods under dfferent scenaros. Ths paper starts the evaluaton by usng synthetc datasets n a controlled settng. Algnments are used to correct some artfcal errors n the outcome of the statstcal model. Snce t s possble to control the exact source of the error n a synthetc dataset, we could analyse the smulaton propertes of dfferent algnment algorthms and the probabltes transformaton n a fully transparent setup. Synthetc dataset based evaluaton tests the algnment performances of dfferent models n four dfferent scenaros. Each scenaro represents a potental statstcal error that algnment methods try to address or compensate n a mcrosmulaton model. The qualty of the algnment s measured by the target devaton ndex (TDI), and the dstrbuton devaton ndex (DDI), where the groupng varable X s the percentle of the correct probabltes. Computaton cost s measured by the number of seconds the algorthm takes to execute one run. Baselne scenaro Assumng there s a bnary model expressed as followng -1 y logt ( x ) (10), are the parameters n the equaton, and s an error term whch follows a logstc dstrbuton wth zero mean and a varance of / 3.To smplfy the calculaton n the evaluaton, we assgn 0, 1. x s randomly drawn from a standard normal dstrbuton N (0,1). The number of observaton n the synthetc dataset s 100,000. 11

Table 1 lsts all the key statstcs n the baselne scenaro and Fgure 1 llustrates the dstrbuton of the baselne probabltes. Frst scenaro: Sample bas In the frst synthetc test scenaro, we try to replcate an error that commonly exsts n survey datasets: sample bas. Sample bas exsts wdely among survey datasets and t s most commonly corrected by the mplementaton of observaton weghts. Unbased estmatons of behavour equatons depend on accurate weghts. Nonetheless, despte all efforts, survey datasets may stll suffer from varous sample bas, partcularly the selecton bas and the attrton bas n panel dataset, such as ECHP (Vandecasteele and Debels, 2007). Sample bas leads to a non-representatve dataset, whch affects the qualty of smulaton output. Algnment s sometmes used to compensate to the error of sample bas. In our test, a smple sample bas s recreated. We remove 50% of the observatons wth * postve response ( y 0 ) randomly from the baselne dataset. Ths produces a nonrepresentatve sample wth the sze equvalent to 75% of the orgnal one. In other words, * the observatons wth negatve response ( y 0 ) wegh twce as much as they should n the dataset. In addton, the error structure ( ) have a dfferent dstrbuton than the baselne scenaro as a consequence of the bas ntroduced. Second scenaro: Based alpha (ntercept) The second synthetc scenaro ams to replcate a monotonc shft of the probabltes. Ths s commonly used n scenaro analyss, where a certan rato, e.g. unemployment rate, s requred to be ncreased or decreased to meet the scenaro assumptons. By manpulatng the ntercept of the equatons, t s possble to shft the probabltes across all observatons. In ths scenaro, s changed to -1 whle everythng else s constant. The result s a monotonc, but non-unform change n the probabltes. A nonunform transformaton s requred to make sure the probabltes are stll bounded wthn the range of [0,1]. Fgure 2 demonstrates the transformaton graphcally. As seen, the probabltes transformaton curve for the second scenaro stays below 45-degree lne and has a varyng slope. Ths ndcates that the transformaton s monotonc but nonunform. Contrary to the prevous scenaro, the error structure and the number of observatons stay the same n ths setup. Table 1 hghlghts the statstcal dfferences between ths scenaro and the other ones. Thrd scenaro: Based beta The thrd synthetc test scenaro ntroduces a based slope n the equaton. Ths represents a change n the behavour pattern whch could not be captured at the tme of estmaton (e.g. the evoluton of fertlty pattern). In ths scenaro, one may assume that the behavour pattern shfts over tme. Ths partcular setup tests on how algnment works as a correcton mechansm for behavour pattern correcton. The smulated dataset n ths scenaro s generated wth 0.5, half of ts value n the baselne, and therefore creates a dfferent dstrbuton of probablty. Snce x has a mean value of 0, the change does not affect the total sample mean of y at the aggregate level. The transformaton would yeld a dfferent dstrbuton but wth an unchanged sample 12

mean. Fgure 1 graphcally llustrates the dfference n probablty dstrbuton. As seen, the standard devaton of probabltes n scenaro 3 s much lower than the baselne scenaro whle the mean value remans the same. Unlke the frst and second scenaros, the transformaton n ths scenaro causes a nonmonotonc change n probabltes. Observatons wth low probablty ( p 0.5) n baselne scenaro have ncreased probablty snce ther x have negatve values, whle the observatons wth hgh probablty ( p 0.5) have a lower probabltes compared wth the baselne scenaro. Forth scenaro: Based ntercept and beta The last synthetc test scenaro combnes both the change n ntercept and the shft n slope. The new transformed dataset has a 1 and 0.5. Ths scenaro represents a relatvely complex change. The change results n a lowered aggregate mean of y and a non- monotonc change n the ndvdual probabltes. Table 1 Overvew of the Synthetc Data Scenaros Scenaro Synthetc Scenaro Baselne 1 2 3 4 Number of observatons n estmaton 100,000 75,000 100,000 100,000 100,000 Number of observaton n smulaton 100,000 100,000 100,000 100,000 100,000 Mean value of outcome varable 0.500 0.330 0.303 0.500 0.277 0.000-0.695-1.000 0.000-1.000 (0.008) 1.000 0.998 1.000 0.500 0.500 (0.010) Target Rato for Algnment 0.5 0.5 0.5 0.5 N.B.: Coeffcents n the frst scenaro are estmated usng logt model. Standard errors are ncluded n the brackets. As an overvew, table 1 summarse the changes of alpha and beta n dfferent scenaro and compares the key statstcs. As seen, all scenaros have the same number of observaton except the frst one. The mean value of outcome varable ranges from 0.277 to 0.5, and the target for algnment (external value) s 0.5 across all scenaros. Fgure 1 gves a vsualsed pcture of probablty dstrbutons n the dfferent scenaros. We see that all probablty dstrbutons, wth the excepton of baselne and thrd scenaro, exhbt a rght skewed pattern. Fgure 2 further compares the dfference between correct probablty and the transformed probabltes n the above scenaros. 13

Fgure 1 Overvew of Probablty Dstrbuton n Dfferent Scenaros 4 3 Baselne scenaro Scenaro 1 Scenaro 2 Scenaro 3 Scenaro 4 Densty 2 1 0 0.2.4.6.8 1 Probabltes Fgure 2 Overvew of Probablty Transformaton n Dfferent Scenaros Transformed probabltes 1.8.6.4.2 Baselne scenaro Scenaro 1 Scenaro 2 Scenaro 3 Scenaro 4 0 0.2.4.6.8 1 Baselne probabltes NB. Probablty transformaton curve records how probabltes change due to the artfcal errors ntroduced n the scenaro. Evaluaton usng a real world dataset There s no doubt that synthetc evaluaton contrbutes to the understandng of algnment methods thanks to ts complete transparency. An algnment algorthm, 14

however, s only useful when appled to a real-world dataset. Therefore, ths paper also analyses the performance of dfferent algnment algorthms usng a real dataset. In ths real-world evaluaton, we use the 1994-2001 Lvng n Ireland Survey (ECHP-LII) dataset for a smple exercse of labour partcpaton smulaton. The LII survey consttutes the Irsh component of the European Communty Household Panel (ECHP). It s a representatve household panel survey conducted on the Irsh populaton annually for eght waves untl 2001. The data contans nformaton on demographc, employment, and other socal economc characterstcs of around 3500 households n each wave. In 2000, addtonal 1500 households were brought nto the dataset to compensate for the attrton snce 1994. The dataset has been cleaned and adjusted to ensure the consstency as descrbed n L and O Donoghue (2010). Labour partcpaton smulaton s selected because t s one of the popular components n dynamc mcrosmulaton models. The smulaton uses a reduced form equaton for labour partcpaton. Assumng the n-work status y * s derved from followng specfcaton * -1 y logt ( X ) (11) Whereas X s a vector that covers lagged n-work status, educaton, gender, age, age squared, nteracton term between gender and havng a new-born, nteracton term between marrage and gender. In the estmaton, we nclude ndvduals age 15-69 wth known prevous workng status. Table 2 provdes some basc summary statstcs of the varables ncluded and estmaton results are reported n appendx I. Table 2 Overvew of varables ncluded n n-work estmaton Varable (Mean value) In-work Out-work Mean Standard Devaton Mean Standard Devaton Lagged nwork status 0.86 0.32 0.14 0.31 Gender (female=1) 0.38 0.49 0.62 0.49 Age 37.18 13.18 37.60 17.79 Age squared 1555.98 1053.01 1730.53 1447.02 Havng a new-born 0.03 0.17 0.02 0.12 Marrage 0.54 0.50 0.44 0.50 Secondary educaton 0.24 0.43 0.19 0.39 Unversty educaton 0.31 0.46 0.15 0.35 Interacton term: new-born and gender 0.01 0.10 0.01 0.11 Interacton term: marrage and gender 0.18 0.39 0.33 0.47 Number of observatons n the category 31784 29448 Total number of observatons 61232 In the prevous lterature of mcrosmulaton valdaton, Caldwell and Morrson (2000) suggest usng n-sample valdaton, out-of-sample valdaton and multple-module valdaton to evaluate smulaton output. Ths paper follows a smlar approach for algorthm evaluaton except that there s no mult-module evaluaton snce algnment s usually an ntegrated part of a more complex model. 15

In-sample evaluaton assesses the predctve power of the model n descrbng the data on whch t was estmated. In ths scenaro, we test how well the model replcates the labour partcpaton rate n year 1998 wth known external control (observed number of workers) usng dfferent algnment methods. 1998 s selected because t s n the mddle of perod data covers. Equaton coeffcents are estmated from whole panel wth the excepton of frst wave where lagged n-work status s not avalable. Algnment performance ndcators are calculated n the same way as n the synthetc dataset evaluaton. An n-sample evaluaton test s useful but t s dfferent than the real mcrosmulaton exercse where the values are predcted out of sample. An out-of-sample evaluaton attempts to measure the predctve power of the model n explanng data of a smlar type whch were not used n the estmaton of the model (Caldwell, 1996). In ths partcular test, we use year 1995-1998 data to predct the perod 1999-2001 wth the known external control (the observed number of workers) and analyse the dfferences n algnment methods performances. The benchmark dstrbuton for DDI s the actual observed dstrbuton n year 1999-2001. V. EVALUATION RESULTS Ths secton reports the evaluaton results of sx dfferent algnment algorthms and compares ther performances under dfferent scenaros through false postve/negatve rate, two self-defned ndces (TDI, DDI) and computatonal tme. Evaluaton Results usng Synthetc Datasets Table 3 lsts four key ndcators obtaned when evaluatng usng synthetc datasets, Target devaton ndex (TDI), False postve rate, False negatve rate and, Dstrbuton devaton ndex (DDI). The DDI n ths synthetc dataset based test uses the percentle of dependent varable as groupng varable X. Table 3 Propertes of Dfferent Algnment Methods n Synthetc Dataset Test Method TDI False Postve False Negatve DDI Scenaro 1: Selecton Bas Multplcatve scalng -0.43% 19.33% 19.76% 0.40% Sdewalk hybrd wth nonlnear adjustment 0.00% 20.63% 20.63% 0.03% Central lmt theorem approach 0.00% 19.65% 19.65% 0.43% Sort on predcted probablty (SOP) 0.00% 16.31% 16.31% 11.50% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 21.09% 21.09% 0.15% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 20.69% 20.69% 0.03% Scenaro 2: Based Alpha (Intercept) Multplcatve scalng -1.41% 18.74% 20.15% 0.61% 16

Sdewalk hybrd wth nonlnear adjustment 0.00% 20.69% 20.69% 0.03% Central lmt theorem approach 0.00% 19.29% 19.29% 0.65% Sort on predcted probablty (SOP) 0.00% 16.31% 16.31% 11.50% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 21.31% 21.31% 0.30% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 20.70% 20.70% 0.03% Scenaro 3: Based beta coeffcents Multplcatve scalng -0.18% 22.58% 22.76% 0.90% Sdewalk hybrd wth nonlnear adjustment -0.01% 22.59% 22.60% 0.84% Central lmt theorem approach 0.00% 22.69% 22.69% 0.91% Sort on predcted probablty (SOP) 0.00% 16.31% 16.31% 11.50% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 22.54% 22.54% 0.87% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 22.56% 22.56% 0.88% Scenaro 4: Based alpha and beta (all coeffcents) Multplcatve scalng 0.18% 21.57% 21.39% 0.26% Sdewalk hybrd wth nonlnear adjustment 0.00% 22.45% 22.44% 0.85% Central lmt theorem approach 0.00% 21.54% 21.54% 0.28% Sort on predcted probablty (SOP) 0.00% 16.31% 16.31% 11.50% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 22.97% 22.97% 1.33% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 22.67% 22.67% 0.92% Average Performances Multplcatve scalng -0.46% 20.55% 21.02% 0.54% Sdewalk hybrd wth nonlnear adjustment 0.00% 21.59% 21.59% 0.44% Central lmt theorem approach 0.00% 20.79% 20.79% 0.57% Sort on predcted probablty (SOP) 0.00% 16.31% 16.31% 11.50% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 21.98% 21.98% 0.66% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 21.66% 21.66% 0.46% As seen n table 3, all algnment methods except multplcatve scalng, n all scenaros, have less than 0.01% devaton from the target number of event occurrence whle multplcatve scalng shows a devaton up to 1.41% from the target durng the evaluaton. The result s largely drven by the desgn of the algorthm, as multplcatve scalng cannot guarantee a perfect algnment rato although the expected devaton s zero. Sdewalk hybrd sometmes has a slght devaton (less than 0.01%), as the non-lnear transformaton may not be always perfect under exstng mplementaton 6. Central lmt theorem methods have bult-n 6 The process usually requres several teratons and t s computatonally expensve (Neufeld, 2000). Our test model used n ths paper stops ts calbraton when the teraton only mproves the average probablty by no more than 10-8. Ths ncreases the calculaton speed but sometmes results n mperfectly algned probabltes. Detals of the calbraton steps can be found n the book publshed by Socety of Actuares (SOA, 1998). 17

counters that prevent the events from manfestng when the target s met. Sortng based algorthms only pck the exact number of observatons requred, whch s why ther target devaton ndex (TDI) s always zero. In terms of false postve and false negatve rates when compared wth the correct values, algnment method SOP yelds the best result, whch s on average 4 to 6 percentage ponts lower than other algorthms, as shown n the tables. Sdewalk Hybrd, together wth SOD, SODL, have the hghest false postve/ false negatve rates on average. It seems that the false postve and false negatve rates are closely related to the complexty of the algorthms. The nonlnear transformaton n Sdewalk Hybrd and dfferencng operatons n SOD and SODL are both more computatonally complcated than the other methods. Ths pattern s consstent across all scenaros, though absolute numbers fluctuate across dfferent scenaros. Whlst false postve and false negatve s a useful ndcator when the correct value s known, t s a less crtcal ndcator for smulaton as mcrosmulaton exercses tend to focus more on the dstrbutons. Therefore, the dstrbuton devaton ndex (DDI) s partcularly mportant n judgng how well the relatve relatons between varables are preserved after algnment. Appendx 2 vsualses the dfference between actual probabltes and algned probabltes n all synthetc tests. The results show that SOP method heavly dstorts the orgnal dstrbuton of the probabltes across all scenaros usng percentle groupng. Ths s also reflected by dstrbutonal devaton ndex (DDI), whch s effectvely calculatng a weghted sze of the gap n ths case. It seems that there s no method consstently outperformng across all scenaros. In the frst two scenaros, sdewalk hybrd and SODL method gves the best result; In the thrd scenaro, where the synthetc dataset modfes the slope of x, all methods have smlar DDI values except SOP; In the last scenaro, multplcatve scalng and central lmt methods generally perform much better than the rest. Compared wth other methods, methods whch nvolves dfferencng and logstc transformaton (ncl. sdewalk hybrd wth non-lnear transformaton, SOD and SODL) seem to be more senstve to the change n the beta coeffcent. Ther performances are much better when beta remans stable, e.g. scenaro 1 and 2. Ths may be due to the nature of these algorthms as the dfferencng and logt transformaton operatons assume monotonc changes n the probabltes. Evaluaton Results usng a Real-world Dataset The synthetc dataset based evaluaton offers an overvew of the performances of dfferent algorthms under partcular source of nose, but the performance wth realworld dataset s more nterestng for emprcal modellers. Table 4 reports all the key ndcators calculated when applyng algnment n a real lfe dataset wth the example of estmatng n-work populaton. DDI s calculated based on ndependent varables, ncludng sex, educaton, marrage status wth chldbrth nteracton, and external varable, natonaltes. It reflects an overall shft of the dstrbuton n mult-dmensons. Table 4 Propertes of Dfferent Algnment Methods wth a Real World Dataset (LII) Method TDI False Postve In-Sample Evaluaton False Negatve DDI 18

Multplcatve scalng 0.24% 10.00% 9.76% 0.62% Sdewalk hybrd wth nonlnear adjustment 0.01% 9.47% 9.45% 0.64% Central lmt theorem approach 0.00% 9.57% 9.57% 0.62% Sort on predcted probablty (SOP) 0.00% 5.86% 5.86% 0.62% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 9.64% 9.64% 0.62% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 9.60% 9.60% 0.67% Out-of-Sample Evaluaton Multplcatve scalng 0.10% 11.24% 11.14% 0.75% Sdewalk hybrd wth nonlnear adjustment 0.00% 11.04% 11.04% 0.68% Central lmt theorem approach 0.00% 11.12% 11.12% 0.74% Sort on predcted probablty (SOP) 0.00% 7.63% 7.63% 1.47% Sort on the dfference between predcted probablty and random number (SOD) 0.00% 11.14% 11.14% 0.66% Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.00% 11.03% 11.03% 0.76% N.B.: In-sample evaluaton predcts 1998 n-work usng 1995-2001 data Out-of-Sample evaluaton predcts 1999-2001 n-work usng 1995-1998 data Smlar to the results from synthetc dataset, multplcatve scalng s the only method wth a TDI greater than 0.01% and the SOP method outperforms all other methods n terms of false postve and false negatve rates at a sgnfcant margn. All other evaluated methods have smlar false postve and negatve rates. As to the DDI, there s no dramatc dfference between dfferent methods n n-sample evaluaton. We notce that the SOP method has a much more comparable DDI performance n the real lfe dataset than n the synthetc dataset. In fact, SOP has one of the best results n n-sample evaluaton. In the out-of-sample exercse, we fnd that the SOD, a method wth average performance wth synthetc datasets, has the lowest DDI value, whle SOP has the worst result. Besdes the algorthm desgn, the change of groupng varables also affects the observed DDI pattern n ths evaluaton. Wth the synthetc datasets, groups are dvded based on the percentle value of the dependent varable whle n the real-world dataset, observatons were grouped usng a realstc settng, usng dfferent characterstcs varables, lke age, gender etc. Computng Performance and Scalablty Computatonal effcency s another man crteron for evaluatng algnment algorthm. Gven the ncreasng avalablty of large-scale datasets n mcrosmulaton and the model complexty, algnment may consume consderable resources n the computaton processes. Nonetheless, the study of the computatonal effcency s rather scarce n the feld of mcrosmulaton and there s no paper so far analysng how the number of observatons affect the algorthms performance. Ths secton compares dfferent algnment algorthms n terms of computaton effcency and dscusses the ssue of scalablty of the algorthms. Table 5 shows an overvew of the computaton tme requred durng the synthetc scenaro test and real-world data test. The computatonal premum s tmed on an Intel 5-520m processor when only sngle core s used. As ndcated, the method that takes 19

least computaton resources s multplcatve scalng method. Ths s not surprsng, as multplcatve scalng nvolves only a sngle calculaton for each observaton. Sortng-based algnment methods seem to be n the next ter, whch consume up to 5 tmes more resources compared wth multplcatve scalng. The varatons n sortng method does not change the executon tme much although the last sortng varaton, SODL, consumes around 10% more resources than the other sortng based algorthms due to ts hgher computaton complexty. Sdewalk Hybrd wth nonlnear transformaton seems to be on the bottom lst n terms of the effcency. It takes about 80 tmes more CPU tme than what the fastest method, multplcatve scalng, requres, and 15-20 more CPU tme than the sortng based algorthms. There are three reasons for ts relatvely poor performances. Frstly, the nonlnear transformaton may take many teratons and t s computatonal expensve (Neufeld, 2000). Secondly, the method tself suffers from seral correlaton n the orgnal desgn, as the calculaton s dependent on the result of the last observaton. In order to mtgate ths effect, an extra randomsaton va sortng s mplemented. Ths s accompaned by a reverse process, whch restores the orgnal order of the nput at the end of the algnment. Thrdly, the Sdewalk method requres teratng through observatons. Stata, whch s the platform of our evaluaton, s not partcular effcent at ndvdual observaton teraton compared wth the batch processng for whch Stata optmses 7. Ths s also the prmary reason why Central lmt theorem approach has a relatvely long runnng tme. We speculate from a theoretcal pont of vew, that the performances of the Sdewalk method and the Central lmt theorem approach could be sgnfcantly mproved when mplemented correctly as natve code n C/C++ as compled code does not re-nterpret the syntax over the teratons. Nonetheless, sdewalk method may stll be slower than the other algorthms when nonlnear probablty transformaton s appled. Table 5 Computatonal Costs for Dfferent Algnment Methods Synthetc Dataset Scenaro Real-world Dataset Method 1 2 3 4 In- Sample Out- Sample Multplcatve scalng 0.07 0.07 0.07 0.07 0.04 0.13 Sdewalk hybrd wth nonlnear adjustment 5.71 5.88 5.49 5.78 1.30 4.22 Central lmt theorem approach 3.34 3.40 3.50 3.55 0.63 2.12 Sort on predcted probablty (SOP) 0.32 0.33 0.33 0.35 0.17 0.58 Sort on the dfference between predcted probablty and random number (SOD) 0.34 0.34 0.34 0.34 0.18 0.61 Sort on the dfference between logstc adjusted predcted probablty and random number (SODL) 0.36 0.36 0.36 0.38 0.18 0.63 When ncreasng the number of observatons,.e. sze of nput, all algorthms exhbt a mostly lnear growth rate of the executon tme n Stata (See fgure 3 to fgure 5) for a dataset under 15 mllon observatons. The run-tme seems to be drectly proportonal to ts nput sze. All algnments are usng the same nput dataset, whch s a randomly 7 Observaton teraton, a necessary step for these two algorthms, tends to be very slow n Stata because loops are renterpreted at each teraton. Stata recommends usng compled plug-n for the best performance for ths type of scenaros (Stata, 2008). However, algorthm specfc optmzaton usng compled code s beyond the scope of ths paper and t would make the comparson dffcult. 20