SESUG 011 Paper PO-3 A Codng Practce for Preparng Adaptve Multstage Testng Yung-chen Hsu, GED Testng Servce, LLC, Washngton, DC ABSTRACT The purpose of ths paper s to present a smulaton study of a codng practce for preparng adaptve multstage testng (MST) desgns for a credentalng testng program n the comng years MST s an adaptve test admnstraton method n whch a test form s talored as a sequence of pre-constructed modules at tem set level At each adaptaton pont a module s selected to match the profcency estmate of the examnee based on cumulatve performance on prevously admnstered modules For some testng programs, MST s consdered a better ft n ther future test development because the test delvery model offers a balanced tradeoff and a promsng ameloraton between the computerzed adaptve tests and the tradtonal lnear fxed-length tests In the smulaton, a macro s developed to estmate the profcency scores based on tem response theory The algorthm s mplemented wth PROC IML usng Newton-Raphson method To assess the classfcaton consstency and decson accuracy for examnees, kappa coeffcents from PROC FREQ and addtonal consstency measures are computed to more fully characterze the extent of the agreement Practcal polcy questons and test development consderatons are also dscussed INTRODUCTION For many credentalng testng programs, usng computers to admnster exams s a trend n the comng years There are good practcal reasons why adoptng a computer-based test admnstraton s preferred, whch nclude automated scorng and fast reportng, flexble exam schedules and locatons, and potentally hgher effcency and more precse profcency estmaton of examnees through computer-based or adaptve testng Several nnovatve test delvery models were consdered n practce, such as computerzed fxed testng (CFT), temlevel computer-adaptve testng (CAT), and multstage testng (MST) (Henrckson, 007; Jodon, Zensky, and Hambleton, 006) A CFT s analogous to the fxed-tem paper-and-pencl test (PPT) but wth more modern varetes that can be admnstered For example, dfferent examnees may take dfferent forms of the test or receve the tems n dfferent orders In contrast, CAT adapts the dffculty of the test and presents each new tem based on the profcency estmate of an examnee s performance on prevous tems CAT generally uses much fewer test tems than CFT does and has the advantage of offerng mproved effcency n estmatng examnee s profcency level However, there are potental psychometrc ssues and practcal shortcomngs found n the past CAT practces, such as tem exposure control and content balancng Besdes, data management effort and deployment cost are busness and fnancal concerns for admnsterng the test n operatonal envronment To balance the tradeoff between CFT and CAT, MST was proposed MST s a test admnstraton method closely related to CAT but has a test adaptaton at the tem set level nstead For some examnaton programs MST s consdered as a better delvery model n the test development MST may help amelorate the problems encountered n a tradtonal CAT yet stll offers better testng effcency than PPT or CFT Dependng on the test nature and practcal polcy of the testng program, there are practcal test development consderatons related to the desgn and mplementaton Ths study s a codng practce usng smulated data n preparng nformaton for decson makers of a credentalng testng program n the future ADAPTIVE MULTISTAGE TESTS Fgure 1 depcts the generalzed procedure of adaptve multstage test desgn A test form conssts of a seres of test modules and each test taker would potentally take a dfferent set of modules that s best targeted to the ndvdual s ablty MST starts wth an ntal test module for all examnees The ntal module commonly contans tems wth moderate dffculty at the medan profcency of the ntended group or a broad range of dffculty values Wth the examnee s performance on the ntal module, a profcency score can be estmated The profcency estmate s then used to select the next module that matches the examnee s profcency level Normally, the accumulated performance was used to estmate the profcency for decdng a module wth narrow and more focused dffculty n each round untl the test ends 1
SESUG 011 Select the frst stage test module Admnster test Estmate profcency End? Yes Report fnal profcency No Select the next stage test module Fgure 1 MST procedure SIMULATION A smulaton usng the two-stage test desgn was conducted to demonstrate the procedure for the preparaton work of the test development Rasch model s used n ths smulaton, whch s the smplest tem response theory (IRT) model for dchotomous tem havng only one parameter for the examnee and one for the tem that genercally referred to as a threshold Mathematcally, Rasch model can be expressed as P( u j 1 1) 1 exp( b ) representng the probablty of answerng a partcular dchotomously scored tem correctly gven the profcency level of a test taker, where b s the dffculty of tem whle s the ablty of person j The steps of conductng the smulaton study are outlned as follows: Data preparaton: Smulate true profcency scores ( t group ID, a sngle stage module, and both frst and second stage modules j j ), tem parameters ( b ), tem responses ( u ), true Profcency estmaton: Use u and b from the frst stage to estmate profcency score 1 Based on 1 to assgn second stage modules Combne data from both stages and estmate the fnal profcency score 1 Also estmate sngle stage profcency scores 0 for comparson Evaluaton: Calculate psychometrc propertes and related statstcs for evaluaton DATA PREPARATION To smulate true profcency scores for 3,000 test canddates, N(0,3) t are generated from a normal dstrbuton wth predefned upper and lower bounds We assume that the test wll be used to classfy the canddates nto three groups: A, B, and C (eg, pass advanced, pass, and fal) The canddates are dvded nto three groups Three sets of tem parameter b are also generated wth 1,000 each accordng to the true profcency scores t from a normal dstrbuton wth dfferent mean and bounds The three sets are combned form a pool of 60 tems and b usng Rasch model One frst stage module, whch contans Then, the responses u are generated from t 30 tems, and three second stage modules wth 0 tems each are assgned A 50-tem sngle-stage module s also assgned for comparson The frst stage module s desgned to contan a broad range of dffculty values whle the second stage modules has more tems wth dffculty located near the average profcency scores of the respectve group Fgures and 3 llustrate the test nformaton curves of the frst and second modules, respectvely The test nformaton s smply the sum over tems of the amount of tem nformaton Namely,
SESUG 011 The tem nformaton functon s defned as where Q 1 P, and ) P ( ) I ( ) I ( ) I ( ) P( ), P ( ) Q ( ) P ( For Rasch model, the expresson s smply I ( ) P ( ) Q ( ) 5 0 4 5 4 0 3 5 3 0 5 0 1 5 1 0 0 5 0 0-5 -4-3 - -1 0 1 3 4 5 t het a Fgure Test nformaton curve of the frst stage module 5 0 4 5 4 0 3 5 3 0 5 0 1 5 1 0 0 5 0 0-5 -4-3 - -1 0 1 3 4 5 t het a Fgure 3 Test nformaton curves of the second stage modules 3
SESUG 011 PROFICIENCY ESTIMATION The measure of the profcency or ablty of a gven examnee s the maxmum lkelhood estmate based on the responses to the tems and the values of the parameters of the tems For a test module wth N tem, u {0,1 } refers to a test taker s response to tem, whch s scored dchotomously Under the assumpton of local ndependence, the probablty of the vector of tem response U ( u1, u,, un ) s gven by the lkelhood functon U P u Q 1 u Pr( ), where P s the Rasch functon, Q 1 P, and s the ablty of the test taker The dervatves of the loglkelhood functon wth respect to the test taker s L 1 up 1 P Q ( 1 u ) Q, where L s the natural logarthm of the lkelhood functon Pr For Rasch model, we have Then P P Q L ( u P ) and and Q P L Q Usng a Taylor seres expanson to solve the lkelhood equatons, we have where 0 L( ) L( 0 ) L( 0 ) ( 0 ) 0 can be vewed as a tral value for the root of at the n th step The approxmate value of the next step n1 can be derved from (u P ) n1 n PQ wth second-order approxmaton The above teratve scheme s known as the Newton-Raphson method and the process must be repeated untl become suffcently small A SAS/IML module, whch mplemented the Newton-Raphson method, s used to smplfy the task The followng statements show a macro that calls the IML module to estmate the ablty for every test taker n an teraton loop The ntal tral values are all set to be zero n the macro To mprove the effcency, they can frst be replaced and estmated by usng the total score or other means %macro rbtrasch( /* Ablty estmaton */ dsr=, /* tem response */ dsp=, /* Item parameter */ dst= /* Ablty */ ); P Q proc ml; nmaxiter=30; mndelta=001; ubtheta=5; lbtheta=-5; *max teraton number; *theta upper bound; *theta lower bound; use &dsr; read all var _num_ nto r; use &dsp; read all var _num_ nto b; 4
SESUG 011 ntakers=nrow(r); nitems=ncol(r); nitems1=nrow(b); * Error check; f nitems^=nitems1 then do; prnt "ERROR: Inconsstent nputs"; stop; * Newton-Raphson equaton teraton loop; start rascht(t0,pb,r,mxt,ubt,lbt,mnd); t=t0; nt=1; n=nrow(pb); do whle(nt<=mxt); snum=00; sdem=00; do =1 to n; p=10/(10+exp(pb[]-t)); w=p*(10-p); v=r[]-p; snum=snum+v; sdem=sdem+w; dta=snum/sdem; * Check convergence and set bounds; f abs(dta)<mnd then nt=mxt; else f dta>ubt then delta=ubt; else f dta<lbt then delta=lbt; * Update; t=t+dta; nt=nt+1; return (t); fnsh rascht; * Intal estmate t0=j(ntakers,1,0); * Loop through every test taker; theta=j(ntakers,1,0); do j=1 to ntakers; theta[j]=rascht(t0[j],b,r[j,],nmaxiter,ubtheta,lbtheta,mndelta); create &dst from theta[colname='theta']; append from theta; close theta; qut; run; %mend rbtrasch; EVALUATION The correlaton matrx of the true profcency scores, estmated scores from the two-stage (30 and 0 tems) test and from the sngle stage test s provded n Table 1 by usng PROC CORR procedure The correlaton between the true score and the two-stage profcency estmates s hgher than the correcton between the true score and the snglestage profcency estmates Table 1 Correlaton matrx of true scores, two-stage, and sngle stage profcency estmates True score Two-stage Two-stage 085134 Sngle-stage 07085 079453 5
SESUG 011 For most credentalng testng programs the decson accuracy for classfyng canddates s crucal We assumed that both A and B groups are collapsed as Pass, and C group s Fal Then the Cohen s kappa coeffcent, whch provdes a measure of agreement, can be obtaned from PROC FREQ procedure wth TEST KAPPA opton as Po Pc 1 P where P o s observed agreement and P c s chance agreement The values of kappa range from -1 to +1 However, negatve kappa s unusual n practce as the observed agreement s less than change agreement A number of studes (Sm and Wrght, 005; Vera and Garrett, 005) show that there are other factors can nfluence the magntude of kappa and suggested reportng addtonal ndces for provdng a clear pcture, such as prevalence ndex and bas ndex Both of them are ncluded n Table although low kappa and hgh prevalence are very rare n most well desgned educatonal assessment program The decson accuracy of the two-stage case s slghtly hgher but not sgnfcant n the smulaton The mean and standard devaton of the dfference to the true score are also provded even though accurate profcency estmates are less crtcal for most credentalng tests The results show that two-stage desgn yelds more accurate estmates Table Cohen s kappa, Prevalence ndex, and Bas ndex Cohen kappa Prevalence ndex Bas ndex Mean Standard devaton True/Two-stage 08516 0357 00077 0341 10768 True/Sngle-stage 08475 03357 0003 0486 1657 CONCLUSION Ths smulaton study s a codng practce of prelmnary preparaton work for a credentalng testng program n the comng years MST s beng consdered and s expected to have some dstnct advantages over conventonal fxedlength testng In the test development, there are many ssues to resolve In order to provde nformaton for decson makng, parameters and data wll be adjusted accordngly when more nformaton, such as data collected from feld testng, tem characterstcs n the future tem bank, and data derved from prevous tests, become avalable Ths paper llustrates the procedure usng SAS n preparng nformaton for makng decson and outlnes some steps for use n the development REFERENCES Hendrckson, A (007) An NCME Instructonal model on multstage testng Educatonal Measurement: Issues and Practce, 6(), 44-5 Jodon, MG, Zensky, A, and Hambleton, RK (006) Comparson of the psychometrc propertes of several computer-based test desgns for credentalng exams wth multple purposes Appled Measurement n Educaton, 19(3), 03-0 Sm, J and Wrght, CC (005) The kappa statstc n relablty studes: Use, nterpretaton, and sample sze requrements Physcal therapy, 85(3), 57-68 Vera, AJ and Garrett, J M (005) Understandng nterobserver agreement: The kappa statstc Famly Medcne, 37(5), 360-363 CONTACT INFORMATION Your comments and questons are valued and encouraged Contact the author at: Yung-chen Hsu GED Testng Servce, LLC One Dupont Crcle NW Washngton, DC 0003 Work Phone: 0-939-9717 E-mal: yung-chenhsu@gedtestngservcecom Web: wwwgedtestngservcecom SAS and all other SAS Insttute Inc product or servce names are regstered trademarks or trademarks of SAS Insttute Inc n the USA and other countres ndcates USA regstraton Other brand and product names are trademarks of ther respectve companes c 6