Inter-Rater Agreement in Medical Studies: Analysis Tools within the SAS System

Inter-Rater Agreement in Medical Studies: Analysis Tools within the SAS System Dipl.Ing. Dr. Barbara Schneider, University of Vienna, Austria Abstract In this presentation the intraclass correlation coefficient, a graphical technique to check agreement visually and the κ-measure are discussed. Emphasis is laid on how these statistical methods and techniques can be performed within the SAS System and if they are not available within SAS Software how they can be implemented. Data from clinical trials are presented. SAS code is reported using the following tools: PROC GLM, PROC GPLOT, PROC FREQ, Macro Language and SAS/IML. Introduction Measurement in medical studies is very often an imprecise task and the errors associated with this should be quantified and understood. Since measurement error can seriously affect statistical analyses and interpretation, it is important to assess the amount of such error by calculating a riability index. To investigate agreement in continuous variables a graphical technique proposed by Bland and Altman (1986) and the intraclass correlation coefficient are used. For categorial data the κ-measure is appropriate. Studies comparing two or more methods are common. The aim of these studies is to see if the methods agree well enough for one method to replace the others or for the methods to be used interchangeably. The same considerations apply to studies comparing two or more observers using one method. An important aspect of method or observer comparison is the comparison of the repeatability of each observer or method. Graphical Technique The inter-observer (within observer or between methods) agreement is demonstrated by a graphical technique suggested by Bland and Altman (Lancet, Feb. 8, 1989).

In order to assess inter-observer agreement graphically, the difference between the two measurements of each of the two observers are plotted against the mean of the two values of the observers for each patient. Limits of agreement, defined as twice the standard deviation of the difference between the observers, are calculated and plotted in the figure. If we suppose that these differences follow a normal distribution, 95% of the differences will lie between these limits. To demonstrate the procedure, data from Altman (1991) are used comparing two measurement methods. Measurements of transmitral volumetric flow (MF) by Doppler echocardiography and left ventricular stroke volume (SV) by crosssectional echocardiography in 21 patients without aortic valve disease are shown. The following program implements this technique within the SAS System and produces this plot: DIFF 20 10 0-10 -20 40 50 60 70 80 90 100 110 120 130 140 MEAN

goptions reset=(axis, legend, pattern, symbol, title, footnote) norotate hpos=0 vpos=0 htext= ftext= ctext= target= gaccess= gsfmode= ; title; data ms; input patient mf sv; cards; 1 47 43 2 66 70 3 68 72 4 69 81 5 70 60 6 70 67 7 73 72 8 75 72 9 79 92 10 81 76 11 85 85 12 87 82 13 87 90 14 87 96 15 90 82 16 100 100 17 104 94 18 105 98 19 112 108 20 120 131 21 132 131 ; proc print data=o; data o; set o; d1=m_diff+2*s_diff; d2=m_diff-2*s_diff; call symput("m1",m_diff); call symput("d1",d1); call symput("d2",d2); symbol1 c=black v=circle i=none l=1; symbol2 c=black v=none i=join l=1; proc gplot data=temp; plot diff*mean=1 r*mean=2 / overlay vref= &m1 &d1 &d2 lvref=2 cvref=black caxis=black vaxis=&l to &u by &s ; title; %mend p; %p(mf,sv,-20,20,10); quit; %macro p(x,y,l,u,s); data temp; set ms; mean=mean(&x,&y); diff=&y-&x; r=0; /*in order to show perfect agreement: diff=0 */ proc univariate data=temp normal; var mean diff ; output out=o mean= m_mean m_diff std= s_mean s_diff ;

Intraclass correlation coefficient To quantify agreement the intraclass correlation coefficient (ICC) can be used for continuous variables. Roughly speaking the ICC can be viewed as a ratio of the variance of interest over the sum of the variance of interest plus error. Analysis of variance models are applied to calculate the ICC. To estimate the repeatability of each method or observer a one-way random effect model is performed. The resulting variance components are taken to calculate the ICC. Let x ij denote the ith rating (repetition) (i=1,...,r) on the jth target (e.g. patient) (j=1,...,n). Then the following linear model is assumed : x ij = µ + b j + e ij. In this equation, the component µ is the overall population mean of the ratings. b j is the difference from µ of the mean across the repeated jth target ratings. e ij represents the residual component. The component b j is assumed to vary normally with a mean of zero and a variance of σ T 2. It is also assumed that the e ij terms are distributed independently and normally with a mean of zero and a variance of σ E 2. The ICC takes the form of the following ratio: ICC = σ T 2 / (σ T 2 + σ E 2 ) To estimate the ICC variance components have to be estimated. PROC VARCOMP within the SAS System computes the necessary variance components but unfortunately does not provide an output data set. Therefore PROC GLM has to be used. The next program shows how to produce an estimate of the ICC within the SAS System.

data msu ; set libname.name; %macro i(t,y); data f ; keep t y; set msu; y=&y; /* the variable containing the scores of interest is mapped to an internal variable y */ t= &t; /* the variable containing targets (random effect ) information is mapped to t*/ proc glm data=f outstat=s1 noprint; class t ; model y= t ; random t ; data s1; set s1 ; mss=ss/df; if _TYPE_="SS3" or _TYPE_="ERROR"; proc transpose data=s1 out=o1 prefix=ms_; var mss; id _SOURCE_; proc transpose data=s1 out=o2 prefix=df_; var df; id _SOURCE_; data icc1; merge o1 o2; r_1=(df_error)/(df_t+1); r=r_1+1; vare=ms_error; vart=(ms_t-vare)/r; ICC=vart/(vart+vare); title2 "estimation of the ICC - scoring variable : &y"; proc print data=icc1 noobs; var icc; title2; %mend i; %i(tname,varname); Investigating the agreement between two or more observers or methods a twoway random effects model is appropriate. Let x ijl denote the ith rating (repetition) (i=1,...,r) on the jth target (e.g. patient) (j=1,...,n) of the lth observer (method) (l=1,...,k). Then the following linear model is assumed : x ijl = µ + a l + b j +(ab) lj + e ijl. In this equation, the component µ is the overall population mean of the ratings. a l is the difference from µ of the mean of the lth judge s ratings. b j is defined as before. (ab) lj is the interaction term between judges and targets. e ijl represents the error term. The component a l is assumed to vary normally with a mean of zero

and a variance of σ 2 J. b j is assumed to vary normally with a mean of zero and a variance of σ 2 T. (ab) lj are assumed to be mutually independent with a mean of zero and a variance of σ 2 I It is also assumed that e ijl are distributed independently and normally with a mean of zero and a variance of σ 2 E. Then the ICC takes the form of the following ratio: ICC = σ T 2 / (σ T 2 +σ J 2 + σ I 2 + σ E 2 ) In the absence of repeated ratings by each judge on each target, the interaction and error components cannot be estimated separately. This situation will be taken into account using the program statement if ms_error=. then ms_error=0; The following program shows how to produce an estimate of the ICC within the SAS System : data msu ; set libname.name; %macro i(t,j,y); data f ; keep t j y; set msu; y=&y; t=&t; j=&j; /* variable containing judge information */ proc glm data=f outstat=s2 ; class t j; model y= j t t*j; random t j t*j ; data s2; set s2 ; mss=ss/df; if _TYPE_="SS3" or _TYPE_="ERROR"; proc transpose data=s2 out=o3 prefix=ms_; var mss; id _SOURCE_; id _SOURCE_; data icc2; merge o3 o4; n=df_t+1; k_1=df_j; k=df_j+1; r=(1+df_error/(n*k)); if ms_error=. then ms_error=0; /* formal reasons*/ vare=ms_error; vari=(ms_t_j-vare)/r; varj=(ms_j-ms_t_j)/(r*n); vart=(ms_t-ms_t_j)/(r*k); icc=vart/(vart+varj+vari+vare); title2 estimation of the ICC - scoring variable : &y ; proc print data=icc2 noobs; title2; %mend i; %i(tname,jname,varname); proc transpose data=s2 out=o4 prefix=df_; var df;

Example As before data from measurements of transmitral volumetric flow (MF) by Doppler echocardiography and left ventricular stroke volume (SV) by crosssectional echocardiography in 21 patients without aortic valve disease are taken to estimate the ICC. This is a method comparison analysis. For each patient (target) and per method (judge) only one measurement is taken. In this situation the interaction and the error component cannot be estimated separately as stated above. The dataset ms is multivariate structured. To make the information stored in ms available for computing the ICC it has to be reshaped: data msu ; keep patient methode value; set ms; methode=1; value=mf; output; methode=2; value=sv; output; Running the macro: %i(patient,methode,value); produces the output: estimation of the ICC - scoring variable : value ICC 0.94621 κ-measure

Agreement between categorial assessments is usually considered as a problem of comparing the ability of different raters (observers, judges) to classify subjects (targets) into one of several groups. The investgations outlined below will also apply to method comparison studies for catigorial data; studies that compare two alternative categorization schemes. Categorial data can be formed in a contingency table where the row levels represent the ratings of one observer (method) and the column levels the ratings of onother observer (method). Suppose p ij is the probability of a target being classified in the ith category by n the first observer and the jth category by the second observer. Then p 0 = is p ii i = 1 the probability that the observers agree. If the ratings were independent, the n probability of agreement would be p e = p p. Thus p 0 - p e is the amount of i= 1 i.. i agreement beyond that expected by chance. The κ-measure (κ coefficient) is defined as: κ = p p 0 e. The κ - measure of agreement can be interpreted as 1 pe the chance corrected proportional agreement. A weakness of the κ is that it takes no account of the degree of disagreement i.e. all disagreements are treated (weighted) equally. When the categories are ordered (ordinal scaled data), it is sometimes preferable to put different weights to disagreements according to the magnitude of discrepancy. This idea leads to the so called weighted κ. Weighted κ is obtained by weighting the frequencies in each cell of the table according to their distance from the diagonal that indicates agreement. For a detailed discussion see e.g. Fleiss (1981). PROC FREQ within the SAS System can be used to compute the κ measures. Example Data show the classification by two radiologists of 85 xeromammograms. 0 corresponds to normal, 1 to benign diseas, 2 to suspicion of cancer and 3 to cancer.

/* For n*n Tables */ /* Simple and Weighted kappa */ /* Altman p.403 Bsp. Tab.14.2 */ data k1; input x y weight; cards; 0 0 21 0 1 12 0 2 0 0 3 0 1 0 4 1 1 17 1 2 1 1 3 0 2 0 3 2 1 9 2 2 15 2 3 2 3 0 0 3 1 0 3 2 0 3 3 1 ; proc freq data=k1; tables x*y / agree; weight weight; This results in the output: TABLE OF X BY Y X Y Frequency Percent Row Pct Col Pct 0 1 2 3 Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 21 12 0 0 33 24.71 14.12 0.00 0.00 38.82 63.64 36.36 0.00 0.00 75.00 31.58 0.00 0.00 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 4 17 1 0 22 4.71 20.00 1.18 0.00 25.88 18.18 77.27 4.55 0.00 14.29 44.74 6.25 0.00 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2 3 9 15 2 29 3.53 10.59 17.65 2.35 34.12 10.34 31.03 51.72 6.90 10.71 23.68 93.75 66.67 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 0 0 0 1 1 0.00 0.00 0.00 1.18 1.18 0.00 0.00 0.00 100.00 0.00 0.00 0.00 33.33 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 28 38 16 3 85 32.94 44.71 18.82 3.53 100.00 STATISTICS FOR TABLE OF X BY Y Test of Symmetry ---------------- Statistic = 15.400 DF = 6 Prob = 0.017 Kappa Coefficients Statistic Value ASE 95% Confidence Bounds ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Simple Kappa 0.473 0.073 0.330 0.615 Weighted Kappa 0.568 0.068 0.436 0.701 Sample Size = 85 The following IML code shows an alternative approach for computing the κ - measure including estimates of the standard error and 95% confidence limits. Comparing the IML approach to PROC FREQ it provides more flexibility in specifying weights. Another advantage is that the IML program is not restricted to n*n tables. If it seems to make sense r*s tables can be used. Restrictions for data entry are that the values of the score variables have to range from 0 to n and the dimension of the smallest squared table in which the resulting r*s table fits has to be specified. A detailed description of the mathematical formulas used in the IML statements is given in Fleiss (1981). /* Not only for n*n Tables */ /* Weighted kappa */ /* Altman S.403 Example Tab.14.2 */ %let dim=4; /* weights for simple kappa*/ %let w=%str( {1 0 0 0, 0 1 0 0, 0 0 1 0, 0 0 0 1} ); data index; do ind1=1 to &dim; do ind2=1 to &dim; output; end; end; data k1; input x y weight; cards; 0 0 21 0 1 12 0 2 0 0 3 0 1 0 4 1 1 17 1 2 1 1 3 0 2 0 3 2 1 9 2 2 15 2 3 2 3 0 0 3 1 0 3 2 0 3 3 1 ; data k; set k1; drop weight i; do i=1 to weight; output; end; proc freq data=k; This results in the output: tables y*x / out=m1 noprint; proc sort data=m1; by x y; data m1; set m1; ind1=x+1; ind2=y+1; data m2; merge index m1; by ind1 ind2; data m2; keep ind1 ind2 count; set m2; if count=. then count=0; proc transpose data=m2 out=mat prefix=c; var count; id ind2; by ind1; data mat; keep c1-c&dim; set mat; /*** IML ***/ proc iml; w=&w; use mat; read all into x; n=sum(x); x1=x#w; po=(1/n)*sum(x1); csum=x[+,]; rsum=x[,+]; KAPPA 0.4727891 y=rsum*csum; y_=csum*rsum; y=(1/n)*y; p=(1/n)*y; p_=(1/n)*(1/n)*y_; pcsum=p[+,]; prsum=p[,+]; wi=w#prsum; wj=w#pcsum; wqi=wj[,+]; wqj=wi[+,]; wij=j(&dim,&dim); do i=1 to &dim; do j=1 to &dim; wij[i,j]=wqi[i,]+wqj[,j]; end; end; wq=(w-wij)##2; pw=p#wq; pws=sum(pw); y1=y#w; pe=(1/n)*sum(y1); pke=sum(p_); kappa=(po-pe)/(1-pe); print kappa; var0=(1/(n*(1-pke)*(1- pke)))*(pws-pke*pke); std=sqrt(var0); cl_l=kappa-1.96*std; cl_u=kappa+1.96*std; print var0 std; print cl_l cl_u; quit;

VAR0 STD 0.0048129 0.0693751 CL_L CL_U 0.3368139 0.6087643 SAS, Base SAS, SAS/STAT, SAS/GRAPH and SAS/IML software are registered trademarks of SAS Institute Inc., Cary, NC, USA Address for correspondence: Dipl.Ing. Dr. Barbara Schneider Institut für Medizinische Statistik Universität Wien Schwarzspanierstr. 17 A - 1090 Wien Austria

REFERENCES: Bland J.M., Altman D. G. (1986) Statistical Methods for Assessing Agreement between two Methods of Clinical Measurement, THE LANCET, Feb. 8, 1986 Altman D. G. (1991) Practical Statistics for Medical Research, 1st edn. Chapman&Hall Fleiss J. L. et al. (1979), Inter-examiner Reliability in Caries Trials, J Dent Res 58(29:604-609, February 1979 Fleiss J. L. (1981) Statistical Methods for Rates and Proportions, 2nd edn. New York: Wiley Shrout P. E., Fleiss J. L. (1979) Intraclass Correlations: Uses in Assessing Rater Reliability,Psychological Bulletin 1979, Vol. 86, No. 2, 420-428 SAS Language: Reference, Version 6, First Edition SAS Procedures Guide, Version 6, Third Edition SAS/GRAPH Software: Reference, Version 6, First Edition, Volumes 1 and 2 SAS/IML Software: Usage and Reference, Version 6, First Edition SAS/STAT Software: Reference, Version 6, First Edition, Volumes 1 and 2