Does Pre-equating Work? An Investigation into Pre-equated Testlet-based College Placement Exam Using Post. Administration Data.

Size: px

Start display at page:

Download "Does Pre-equating Work? An Investigation into Pre-equated Testlet-based College Placement Exam Using Post. Administration Data."

Kory Little
5 years ago
Views:

1 Does Pre-equating Work? An Investigation into Pre-equated Testlet-based College Placement Exam Using Post Administration Data Wei He Michigan State University Rui Gao Chunyi Ruan ETS, Princeton, NJ Paper presented at the annual meeting of the National Council on Measurement in Education (NCME) April 14-16, 2009, San Diego, CA. Unpublished Work Copyright 2009 by Educational Testing Service. All Rights Reserved. These materials are an unpublished, proprietary work of ETS. Any limited distribution shall not constitute publication. This work may not be reproduced or distributed to third parties without ETS's prior written consent. Submit all requests through Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS).

2 Abstract Does Pre-equating work? An Investigation into Pre-equated Testlet-based College Placement Exam Using Post- Administration Data This study investigated whether the pre-equating results agree with the equating results based on operational data (post-equating) for a college placement program. We examined the degree to which the IRT true-score pre-equating results agreed with those from IRT true-score post-equating and the results from observed-score equating. Three subjects were examined in this study including Analyzing and Interpreting Literature (AIL), American Government (GOV), and College Algebra (ALG). The findings suggested that differences between equating results by IRT true-score pre-equating and post-equating varied from subject to subject. In general, IRT true-score post-equating agreed with IRT true-score pre-equating for most of the test forms. Any difference among the equating results must be attributed to the way through which items were pretested, contextual/order effects, or the violation of IRT assumptions. Key words: Test equating IRT true-score pre-equating i

3 Acknowledgments The authors thank Neil Dorans for his comments and suggestions on the previous versions of the paper. The authors thank Anthony Giunta for his help with additional computations. 1

4 Theoretical Framework Equating is a statistical process used to adjust scores on two or more forms of a test so that the scores can be used interchangeably (Kolen & Brennan, 2004). Depending on when equating is conducted, it can be further categorized as pre-equating and postequating. Pre-equating, as the name itself implies, refers to the process through which conversions from raw to scaled scores are established prior to the time when the new test is administered operationally. A pre-equating often is based on item response theory (IRT). Due to the fact that pre-equating can establish the conversion table prior to the operational testing, a series of advantages can arise from the use of the pre-equating over the use of post-equating (see Eignor (1985), Kolen (2004) and Kirkpartick and Way (2008) for a complete review). These advantages include more flexible assessment and a better quality-control check for the tests. Perhaps the most appealing feature can be attributed to its ability to facilitate immediate score reporting for tests which require reporting scores right after the test administration. Due to these advantages, pre-equating is sometimes used in the large-scale assessments. Much research has been conducted to compare the difference in equating results from pre-equating as opposed to post-equating. In general, no consistent results were found. Eignor (1985), by investigating the feasibility and practical outcomes of preequating SAT verbal and mathematical section by using IRT true-score equating, revealed that results varied for different subjects and forms in terms that the pre-equating worked adequately for the verbal sections but not for the mathematical section. Similarly, Kolen and Harris (1990), by comparing item pre-equating and random groups equating using IRT and equipercentile methods (1990), found that pre-equating performed poorly for the American College Test (ACT) math test. Contributing reasons for poor preequating results mainly focused on the inconsistent behaviors of test items in pretest and operational contexts. These inconsistencies can be caused by the lack of motivation for taking the pre-test items, different ability distributions used for the pre- and postequating, change in item position, context effects, violation of IRT assumptions, and model-data misfit (Eignor & Stocking, 1986; Stocking & Eignor, 1986; Kolen & Harris, 1990, Tong, et al., 2008). However, other studies suggested that pre-equating can achieve 2

5 satisfactory results. Bejar and Wingersky (1982) compared the pre-equating results of the Test of Standard Written English (TSWE) using different equating methods and came to the conclusion that pre-equating could be a feasible operational procedure. Livingston (1985), using some sort of method similar to regression, demonstrated that pre-equating was highly accurate in three of the four New Jersey College Basic Skills Placement tests. In the most recent two studies conducted by Domaleski (2006) and Tong, et al. (2008), they supported the use of pre-equating by having similar pre- and post-equated scoring tables and similar accuracy of classifying students into different performance levels. Apart from different research findings about pre-equating, a literature review indicates that little research has been conducted on whether pre-equating agrees with the post-equating for a testlet-based and computer-administered testing program. What s more, given the controversial view towards the use of pre-equating and the appealing features that pre-equating can offer, more research is clearly needed in this area. To this end, this study by using the real data investigated whether the pre-equating results agree with the equating results based on operational data (post-equating). The study examined the degree to which the IRT true-score pre-equating results agreed with those from IRT true-score post-equating and the degree to which the IRT true-score preequating results agree with the results from observed-score equating methods. Method Introduction to the Testing Program Exams for the college placement program are administered in computer-based tesing (CBT) format. Testlets are the building blocks for the exams. A testlet is a collection of questions from a coherent content domain. Multiple versions of parallel testlets are created for each content domain; these are combined, one testlet from each content domain, to build unique but parallel forms in terms of content and statistical properties. Take the exam for Analyzing and Interpreting Literature (AIL) for example, the AIL exam consists of 3 types of operational testlets, testlet A, B, and C, from 3 different content domains. Each type of testlet has two parallel versions (A1, A2, B1, B2, C1, and C2). Therefore, different combinations of testlet from each content domain result in 8 parallel forms as indicated in Table 1. 3

6 Insert Table 1 about here Pretest items are embedded in operational testlets. The testlets and pretest items assembled for each administration is called a package. Each package is administered continuously in the field for approximately 3 years. New packages are assembled through replacing a portion of items in the old packages with pretest items when enough data is accumulated on them (500 responses for each pretest item). For AIL, the 2001 package was administered from 2001 to The prestest items accumulated data during those 3 years. The 2004 package was assembled through replacing the operational items in the 2001 package with pretest items. The 8 test forms in an AIL package overlap with one another to varying degrees at the testlet level, as shown above by Table 1. The computerized delivery software assigns a test form at random to a test-taker. Test scores on different forms of the 2004 package are equated to a common reference form, A2B2C2 in 2001 package, to adjust for form differences. Rasch model is used for item calibration. IRT true-score pre-equating is used operationally to report scaled scores which typically range from 20 to 80. To derive a raw-to-scale conversion table for each form of a new package, the following procedures as depicted below are generally followed: Number-Correct Score ne θ new θ reference Number-Correct Score reference Scaled Score To be specific, for a particular new form, the observed number-correct scores on the form are treated as expected IRT true scores, which are then converted to the ability scale (θ new ) based on the Rasch model. The Stocking and Lord (SL) (1983) transformation method is used to place all parameter estimates from separate calibrations on the same metric. Therefore, the ability scores on the testlet-based new forms are on the 4

7 same scale of the ability scores on the common reference form. Using the test characteristic curve for the common reference form, the ability scores (θ reference ) on the reference form are converted to the expected IRT true scores, which are then treated as if they were reference-form number-correct scores. Finally, using a linear conversion associated with the reference form, these raw scores on the reference-form scale are placed onto the 20-to-80 score scale. Data To allow the comparison of the difference in equating results between the preand post-equating, i.e., the equating based on the post-administration data for the different forms of test in the 2004 packages, data obtained from two different packages were used in this study: 1) the 2001 package containing the pretesting data collected during 2001 to 2004; and 2) the 2004 package containing the operational data collected during 2004 to 2008 for those pretest items in the 2001 packages and other operational items. Three different subjects were used in this study: Analyzing and Interpreting Literature (AIL), American government (GOV), and College Algebra (ALG). For AIL, 8 forms were used for this study. Each form has the data from about 3,500 examinees; for GOV, 7 forms were used and each form has about 1,700 examinees; for ALG, 8 forms were used and each form has about 850 examinees. Equating Design and Equating Methods Pre-equating. IRT true-score equating was used for pre-equating. To preequate the forms in the 2004 packages, the response data collected during 2001 and 2004 for only the operational items on the 2001 package were calibrated first. Then, the response data for the whole package, including both operational and pretest items, were calibrated. Finally, the pretest items were put on the same scale as those for only the operational items using the SL method through fixing the parameters for the operational items. The item parameter estimates from the above step were then used to create the 5

8 raw-to-scale conversion table for each form to the reference form using IRT true-score equating. Post-equating. IRT true-score equating and observed-score equating methods were used for post-equating. To conduct the IRT true-score post-equating, the response data collected between 2004 and 2008 on the operational items for the 2004 package were first calibrated. Then, the operational items were put on the same scale as those defined by the 2001 package using SL method through fixing the parameters for the operational items common to both 2001 and 2004 packages. The item parameter estimates from the above step was used to create the conversion table from the IRT true-score post-equating. The observed score equating methods for post-equating are either equivalent group without anchor items (EG) or non-equivalent group with anchor test (NEAT). For the EG design, since examinees were administered a randomly-selected test form, it was reasonable to assume that examinees who took tests from different packages were equivalent. For the NEAT design, as explained above, different test forms shared certain common items. To decide the equating method for observed-score equating, first of all, the equivalence of two groups used to conduct the equating was examined. If the two groups were found equivalent, either equipercentile or mean/sigma linear equating methods were used based on the characteristics of data distribution. If the two groups were found nonequivalent, chained equipercentile or Tucker equating method was used. Generally, if the difference between the linear and equipercentile equating was within the range specified by the difference that matters (DTM), the linear equating method was adopted as the method for observed score equating. Otherwise, equipercentile equating was used. The graphs comparing the equating results between the linear and equipercentile methods with the linear methods as the baseline were included in Appendix B. Evaluation criteria The results from IRT true-score pre-equating method were used as the reference. The conversion lines yielded from different post-equating methods were compared 6

9 against the conversion line from the IRT true-score pre-equating. The notion of DTM (Dorans and Feignebaum, 1994; Dorans, Holland, Thayer, and Tateneni, 2003) was adopted to evaluate the magnitude of the difference of equated scores between the preequating and post-equating methods. A difference of.5 was considered as significant since it means a change in the reporting score. Pass/fail classification rates given by different equating methods were also reported. Each test has two cut scores, C and B cuts. Classification rates for both C and B cuts were reported for each of the three tests in the study. In addition, three other indices were employed: mean signed difference (MSD) (Eq. 1), root mean square difference (RMSD) (Eq. 2), and mean absolute difference (MAD). Notice that all three indices were weighted by the frequency of number-correct raw score at each particular level. According to Kolen and Harris (1990), the MSD and MAD indices are measures of the mean difference between converted scores, while the RMSD is a measure of similarity of conversion tables. MSD= RMSD= i f X X ' i( i i) i i f i f ( X X ) ' 2 i i i i f i Eq. 1 Eq. 2 MAD= i f X X ' i i i i f i Eq. 3 where f i is the frequency of number-correct raw score level i, each of the number-correct raw score level, and X i is the equated score from IRT truescore pre-equating at the number-correct raw score level i. ' X i is the equated score at AIL Results 7

10 For all eight AIL forms, equipercentile equating method using equivalent group design () was selected as the observed-score equating method for post-equating. Figures 1 to 8 portray the scaled score differences between IRT true-score preequating, IRT true-score post-equating, and equipercentile equating for all the eight forms with IRT true-score pre-equating results as baseline at each number correct raw score level. Note that approximately 50 examinees at the top and bottom score scale were excluded from the plots for a clearer picture of the score patterns. In the plots, the number correct raw scores corresponding to the C cut for each form are indicated by the vertical dotted lines. The number correct raw score corresponding to the B cut is indicated by the vertical solid line in all eight figures. The DTM band is indicated by the two horizontal dotted lines. It can be observed that, for all forms, the IRT true-score pre-equating consistently yielded a scale score much higher than those by the other two post-equating methods below the C cut, followed by IRT true-score post-equating and equipercentile methods respectively. With IRT true-score pre-equating method, the examinees could gain up to 3 more scaled score points than with equipercentile method. This difference suggests that IRT true-score pre-equating made the test appear harder than it actually was for the examinees whose raw scores were below the C cut. The difference between the scaled scores yielded by the IRT true-score preequating and the two post-equating methods became smaller and the scaled scores yielded by the IRT true-score pre-equating became lower than those yielded by the postequating methods at number correct raw scores higher than the C cut. However, the difference fell within the DTM band except for Forms 1 and 2 for the equipercentile method. Insert Figures 1-8 about here Table 2 shows that for the classification rate, the IRT true-score pre-equating tended to pass more examinees than the two post-equating methods in total. The equipercentile equating (Obs.) tended to pass the fewest number of examinees. 8

11 Insert Table 2 about here Table 3 reports the mean and standard deviation of the equated scores from different equating methods for AIL. It can be observed that that IRT true-score preequating tended to consistently yield a higher average scaled score and a lower standard deviation than the two post-equating methods. Insert Table 3 about here Table 4 presents the three indices used to evaluate the equating results with IRT true-score pre-equating results as the baseline. All three indices indicated that the IRT true-score post-equating yielded closer results to the IRT true-score pre-equating method by having the smaller RMSD, MSD, and MAD in all forms except for Forms 7 and 8. Insert Table 4 about here American Government For all 7 GOV forms, Mean-Sigma linear equating method using equivalent group design (MS_EG) was selected as the observed-score equating method for post-equating. Insert Figures 9-15 about here Figures 9 to 15 portray the scaled score differences among the results for the IRT true-score pre-equating, the IRT true-score post-equating, and the observed-score equating for all the 7 forms with IRT true-score pre-equating results as baseline at each 9

12 number correct raw score level. Approximately 50 examinees at the top and bottom scaled score were excluded from the plots for a clearer picture of the score patterns. Interestingly, the IRT true-score pre-equating gave slightly lower scaled scores than the IRT true-score post-equating method in four forms Forms 1, 2, 4, and 6. The differences between the two equating methods slightly exceeded the DTM band, suggesting that the IRT true-score pre-equating tended to make the test slightly easier than the IRT true-score post-equating method. For Forms 3, 5, and 7, there was little difference between the scaled scores from IRT true-score pre-equating and IRT truescore post-equating. At a glance of the graphs, the IRT true-score pre-equating roughly gave higher scaled scores than the observed-score equating methods and made the test appear slightly harder except for Form 1. A closer observation found that the patterns of the differences between the scaled scores yielded by the two equating methods were inconsistent across different forms. For Form 1, there was little difference. For Forms 2 and 3, the IRT truescore pre-equating yielded higher scaled scores than the observed-score equating at most of the raw score levels; but at the high raw score levels, it gradually yielded lower scaled scores. For Forms 4 and 5, the IRT true-score pre-equating yielded lower scaled scores at raw score levels lower than the C cut, but gradually yielded higher scaled scores at the raw score levels higher than the C cut. For Forms 6 and 7, the IRT true-score preequating consistently yielded higher scaled scores than observed-score equating. Table 5 shows that for the classification rate, the IRT true-score pre-equating method tended to pass fewer examinees than the IRT true-score post-equating method at both the B and C cuts and in total. The IRT true-score pre-equating and the observedscore equating methods roughly yielded the same passing rates except for form 4, where the observed-score equating yielded lower passing rate in total and at the C cut. Insert Table 5 about here Table 6 reports the mean and standard deviation of equated scores from different equating methods for GOV. It can be observed that the IRT true-score pre-equating has 10

13 yielded lower average scaled scores than the IRT true-score post-equating method on Forms 1, 2, 4, and 6 and slightly higher average scaled scores on Forms 3, 5, and 7. The observed-score equating tended to consistently yield the lowest average scaled scores across all the forms. Insert Table 6 about here Table 7 presents the three indices used to evaluate the equating results with IRT true-score pre-equating results as the baseline. All three indices indicated that for Forms 3, 5, and 7, the IRT true-score post-equating yielded close results to the IRT true-score pre-equating by having RMSD, MSD, and MAD smaller than.044 as highlighted in Table 8. For Forms 1, 2 4, and 6, the indices indicate that the results from the IRT truescore pre-equating differ from those from the two post-equating methods in different directions. The observed-score equating yielded lower scaled scores except for Form 1, while the IRT true-score post-equating yielded higher scaled scores. Insert Table 7 about here Algebra For ALG, the NEAT design was considered more appropriate for the observedscore equating due to the small sample size except for Forms 3 and 8, which did not share any common items with the reference form. For Forms 1, 2 and 5, linear equating using Tucker method (Tucker) was used; for Forms 4, 6 and 7, chained-equipercentile method (Eq%_NEAT) was used; for Forms 3 and 8, equipercentile for equivalent group method was used. Figures 16 to 23 portray the differences in the scaled scores given by IRT truescore pre-equating and the two post equating methods. Approximately 50 examinees at the top and bottom scaled score were excluded from the plots for a clearer picture of the score patterns. 11

14 Insert Figures about here It can be observed that the IRT true-score pre-equating generally yielded a scaled score lower than the two post-equating methods except for Form 6, suggesting that preequating tended to make the test appear easier than it actually was. The difference between the results from the IRT true-score pre-equating and IRT true-score postequating was smaller than that between the IRT true-score pre-equating and the observedscore equating. Except for Form 1, the differences between the scaled scores from the IRT truescore pre-equating and IRT true-score posting-equating were roughly within the DTM at number correct raw scores higher than the C cut. The differences between the scaled scores from the IRT true-score pre-equating and the observed-score equating were within the TDM for most of the number correct raw scores higher than the C cut for Forms 2, 5 and 6, but beyond the DTM for Forms 1, 3, 4, 7, and 8. The differences were more evident in Forms 3, 7, and 8. In terms of classification rate, Table 8 indicates that the IRT true-score preequating passed fewer examinees on Forms 3, 5, 7, and 8 than the two post-equating methods. The passing rates were the same across the three methods for the rest 4 forms. Insert Table 8 about here Table 9 reports the mean and standard deviation of equated scores from different equating methods for ALG. It can be observed that that the IRT true-score pre-equating tended to yield the lowest average scaled scores and higher standard deviations; the observed-score equating method tended to yield the highest scores and lower standard deviations. Insert Table 9 about here 12

15 Table 10 presents the three indices used to evaluate the equating results with IRT true-score pre-equating results as the baseline. All three indices indicated that the IRT true-score post-equating yielded closer results to the IRT true-score pre-equating methods by having the smaller RMSD, MSD, and MAD in all forms except for Form 2. Insert Table 10 about here Conclusion and Discussion For AIL, the reason for higher scaled score from IRT true-score pre-equating than the two post-equating methods can be partly attributed to how pre-test items were tested, or we called it scrolling effects. When items were pretested, they often appeared in a set longer than they did operationally. Since the test is administered through a computer, the examinees may have more difficulty in looking for the relevant information back and forth on the computer screen to answer the items. As a result, the items might appear harder at the pretest stage than at the operational stage. However, for those examinees whose scores were higher, their performance was less likely affected by the scrolling effect. Their scaled scores given by different equating methods were just slightly different. The differences in most forms should be of little concern as they were within the band allowed by the DTM. The result for GOV is not as consistent as for AIL. The scaled score from IRT true-score pre-equating is the same as those from IRT true-score post-equating on 3 of the forms, but is lower than those from IRT true-score post-equating on the other 4 forms. An examination of the items included in each form found that the former 3 forms share the same testlet of 31 items, and the latter 4 forms share a different testlet of 31 items. Interestingly, about half of the items in the latter testlet appear slightly easier at the pretest stage than at the operational stage. This change of difficulty might contribute to the lower scaled scores from the IRT true-score pre-equating. We suspect that the change 13

16 in difficulty may be related to contextual/order effects while the reasons are still under investigation. For ALG, the reason for lower scaled score from IRT true-score pre-equating than the two post-equating methods can be partly attributed to the speededness of the test. Examinee responses from those who were not able to reach the pretest items were excluded from the item calibration. As examinees of lower ability were usually those who were excluded, the calibration based on the data from higher ability examinees would make the items easier and scaled scores lower at the pretest stage. In general, the results from the IRT true-score post-equating are closer to IRT true-score pre-equating than that from the observed-score equating. The difference between IRT true-score equating and the observed-score equating may be due to the violation of IRT assumptions. Research Implication Among all advantages of pre-equating, the most appealing one may be attributed to its ability to allow the timely or even immediate score reporting. To report valid scores, adequate level of consistency of item parameter estimates should be achieved between the pretest and operational stage. As conflicting findings were revealed with regard to the use of pre-equating, more studies are in great need to provide validation information on the accuracy of the pre-equating under different conditions and to suggest the optimal equating and pretesting design that can allow the pre-equating to perform to its best. 14

17 References Bejar, I. I. & Wingersky, M. S. (1982). A study of pre-equating based on item response theory. Applied Psychological Measurement, 6(3), Dorans, N. J., & Feignebaum, M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. In I. M. Lawrence, N. J. Dorans, M D. Feignebaum, N. J. Deryok, A. P. Schmitt, & N. K. Wright (Eds.), Technical issued related to the introduction of the new SAT and PSAT/NMSQT (RM-94-10). Princeton, NJ: Educational Testing Service. Dorans, N. J., Holland, P. W., Thayer, D. T., & Tateneni, K. (2003). Invariance of score linking across gender groups for three Advance Placement Program exams. In N. J. Dorans (Ed.), Population invariance of score linking: Theory and applications to Advanced Placement Program examinations (ETS RR-03-27, pp ). Princeton, NJ: Educational Testing Service. Eignor, D. R. (1985). An investigation of the feasibility and practical outcomes of preequating the SAT verbal and mathematical sections (Research Report 85-10). Princeton, NJ: Educational Testing Service. Eignor, D. R. & Stocking, M. L. (1986). An investigation of possible causes for the inadequacy of IRT true-score pre-equating (Research Report 86-14). Princeton, NJ: Educational Testing Service. Domakeski, C. (2006). Exploring the efficacy of pre-equating a large scale criterionreferenced assessment with respect to measurement equivalence. Unpublished doctoral dissertation, Georgia State University. Kolen, M. J. & Harris, D. J. (1990). Comparison of item preequating and random groups equating using IRT and Equipercentile methods. Journal of Educational Measurement, 27 (1), Kolen, M. J. & Brennan, R. L. (2004). Test Equating: Methods and Practice. New York: Springer-Verlag. Livingston, S. A. (1985). Large-sample pre-equating: how accurate? Paper presented as the annual meeting of the National Council on Measurement in Educational. Chicago, IL. 15

18 Stocking, M. L. & Eignor, D. R. (1986). The impact of different ability distribution on IRT preequating. (RR 86-49). Princeton, NJ: Educational Testing Service. Tong, Y, Wu, S-S, & Xu, M. (2008). A comparison of Pre-Equating and Post-equating using large-scale assessment data. Paper presented at the American Educational and Research Association Annual Meeting in New York City. 16

19 Appendix A Tables and Figures Table 1 Component Testlets for the 8 Analyzing and Interpreting Literature Exam Forms Form 1: A1 B1 C1 Form 5: A2 B1 C1 Form 2: A1 B1 C2 Form 6: A2 B1 C2 Form 3: A1 B2 C1 Form 7: A2 B2 C1 Form 4: A1 B2 C2 Form 8: A2 B2 C2 17

20 Table 2 Classification Rate for Total Pass Rate, C-pass Rate, and B-pass Rate for AIL Form Equating Method N Total Pass N Total % Pass C-pass N % C- Pass B-pass N % B- Pass IRT Pre Form IRT Post IRT Pre Form IRT Post IRT Pre Form IRT Post IRT Pre Form IRT Post IRT Pre Form IRT Post IRT Pre Form IRT Post IRT Pre Form IRT Post IRT Pre Form IRT Post

21 Table 3 Mean and Standard Deviation of Equated Scores from Different Equating Methods for AIL MEAN STANDARD DEVIATION Form IRT Pre Obs. IRT Post IRT Pre Obs. IRT Post Form Form Form Form Form Form Form Form Table 4 RMSD, MSD, and MAD for AIL RMSD MSD MAD Form Obs. IRT Post Obs. IRT Post Obs. IRT Post Form Form Form Form Form Form Form Form

22 Table 5 Classification rate for Total pass rate, C-pass rate, and B-pass rate for GOV Form Equating Method N Total Pass N Total % Pass C-pass N % C- Pass B-pass N % B- Pass IRT Pre Form 1 MS_EG IRT Post IRT Pre Form 2 MS_EG IRT Post IRT Pre Form 3 MS_EG IRT Post IRT Pre Form 4 MS_EG IRT Post IRT Pre Form 5 MS_EG IRT Post IRT Pre Form 6 MS_EG IRT Post IRT Pre Form 7 MS_EG IRT Post

23 Table 6 Mean and Standard Deviation of Equated Scores from Different Equating Methods for GOV MEAN STANDARD DEVIATION Form IRT Pre Obs. IRT Post IRT Pre Obs. IRT Post Form Form Form Form Form Form Form Table 7 RMSD, MSD, and MAD for GOV RMSD MSD MAD Form Obs. IRT Post Obs. IRT Post Obs. IRT Post Form Form Form Form Form Form Form

24 Table 8 Classification rate for Total pass rate, C-pass rate, and B-pass rate for ALG Form Equating Method N Total Pass N Total % Pass C-pass N % C- Pass B-pass N % B- Pass IRT Pre Form 1 Tucker IRT Post IRT Pre Form 2 Tucker IRT Post IRT Pre Form IRT Post IRT Pre Form 4 Eq%_NEAT IRT Post IRT Pre Form 5 Tucker IRT Post IRT Pre Form 6 Eq%_NEAT IRT Post IRT Pre Form 7 Eq%_NEAT IRT Post IRT Pre Form IRT Post

25 Table 9 Mean and standard deviation of equated scores from different equating models for ALG MEAN Standard Deviation Form IRT Pre Obs. IRT Post IRT Pre Obs. IRT Post Form Form Form Form Form Form Form Form Table 10 RMSD, MSD, and MAD for ALG RMSD MSD MAD Form Obs. IRT Post Obs. IRT Post Obs. IRT Post Form Form Form Form Form Form Form Form

26 Appendix A Figure 1. for AIL Form Form 1 IRT Pre-equating IRT post-equating Raw Score Figure 2. for AIL Form Form 2 IRT pre-equating IRT post-equating Raw Score 24

27 Figure 3. for AIL Form Form 3 IRT pre-equating IRT post-equating Raw Score Figure 4. for AIL Form Form 4 IRT pre-equating IRT post-equating Raw Score 25

28 Figure 5. for AIL Form Form 5 IRT pre-equating IRT post-equating Raw Score Figure 6. for AIL Form Form 6 IRT pre-equating IRT post-equating Raw Score 26

29 Figure 7. for AIL Form Form 7 IRT pre-equating IRT post-equating Raw Score Figure 8. for AIL Form 8 Form 8. IRT pre-equating IRT post-equating Raw Score 27

30 Figure 9. Scaled score difference for GOV Form IRT pre-equating MS_EG IRT post-equating Form Figure 10. Scaled score difference for GOV Form IRT pre-equating MS_EG IRT post-equating Form

31 Figure 11. Scaled score difference for GOV Form IRT pre-equating MS_EG IRT post-equating Form Figure 12. Scaled score difference for GOV Form IRT pre-equating MS_EG IRT post-equating Form

32 Figure 13. Scaled score difference for GOV Form IRT pre-equating MS_EG IRT post-equating Form Figure 14. Scaled score difference for GOV Form IRT pre-equating MS_EG IRT post-equating Form

33 Figure 15. Scaled score difference for GOV Form IRT post-equating MS_EG IRT pre-equating Form Figure 16. Scaled score difference for ALG Form IRT pre-equating Tucker IRT post-equating Form

34 Figure 17. Scaled score difference for ALG Form IRT pre-equating Tucker IRT post-equating Form Figure 18. Scaled score difference for ALG Form IRT pre-equating IRT post-equating Form

35 Figure 19. Scaled score difference for ALG Form IRT pre-equating Eq%_NEAT IRT post-equating Form Figure 20. Scaled score difference for ALG Form IRT pre-equating Tucker IRT post-equating Form

36 Figure 21. Scaled score difference for ALG Form IRT pre-equating Eq%_NEAT IRT post-equating Form Figure 22. for ALG Form IRT pre-equating Eq%_NEAT IRT post-equating Form

37 Figure 23. for ALG Form IRT pre-equating IRT post-equating Form

38 Appendix B Results Comparison between Linear and Equipercentile Equating Methods AIL Form 1 MS_EG Raw Score AIL Form 2 MS_EG Raw Score 36

39 - - - AIL Form 3 MS_EG Raw Score AIL Form 4 MS_EG Raw Score 37

40 - - - AIL Form 5 MS_EG Raw Score AIL Form 6 MS_EG Raw Score 38

41 - - - AIL Form 7 MS_EG Raw Score AIL Form 8. MS_EG Raw Score 39

42 - - - GOV Form 1 MS_EG GOV Form 2 MS_EG

43 - - - GOV Form 3 MS_EG GOV Form MS_EG 41

44 - - - GOV Form 5 MS_EG GOV Form MS_EG

45 - - - GOV Form 7. MS_EG Tucker Eq%_NEAT ALG Form

46 - - - Tucker Eq%_NEAT ALG Form MS_EG ALG Form

47 ALG Form Tucker Eq%_NEAT Tucker Eq%_NEAT ALG Form

48 - - - Tucker Eq%_NEAT ALG Form Tucker Eq%_NEAT ALG Form

49 - - - MS_EG ALG Form

Examining the Impact of Drifted Polytomous Anchor Items on Test Characteristic Curve (TCC) Linking and IRT True Score Equating

Research Report ETS RR 12-09 Examining the Impact of Drifted Polytomous Anchor Items on Test Characteristic Curve (TCC) Linking and IRT True Score Equating Yanmei Li May 2012 Examining the Impact of Drifted