**Importance of measuring interrater reliability**

*agreement*among the individuals collecting data immediately arises due to the variability among human observers. Well-designed research studies must therefore include procedures that measure agreement among the various data collectors. Study designs typically involve training the data collectors, and measuring the extent to which they record the same scores for the same phenomena. Perfect agreement is seldom achieved, and confidence in study results is partly a function of the amount of disagreement, or

*error*introduced into the study from inconsistency among the data collectors. The extent of agreement among data collectors is called, “

*interrater reliability*”.

**Theoretical issues in measurement of rater reliability**

*interrater*reliability, and reliability of a single data collector, which is termed

*intrarater*reliability. With a single data collector the question is this: presented with exactly the same situation and phenomenon, will an individual interpret data the same and record exactly the same value for the variable each time these data are collected? Intuitively it might seem that one person would behave the same way with respect to exactly the same phenomenon every time the data collector observes that phenomenon. However, research demonstrates the fallacy of that assumption. One recent study of intrarater reliability in evaluating bone density X-Rays, produced reliability coefficients as low as 0.15 and as high as 0.90 (4). It is clear that researchers are right to carefully consider reliability of data collection as part of their concern for accurate research results.

*survived*or

*did not survive*. There are unlikely to be significant problems with reliability in collection of such data. On the other hand, when data collectors are required to make finer discriminations, such as the intensity of redness surrounding a wound, reliability is much more difficult to obtain. In such cases, the researcher is responsible for careful training of data collectors, and testing the extent to which they agree in their scoring of the variables of interest.

**Measurement of interrater reliability**

*Percent agreement*

*1.00 - percent agreement*may be understood as the percent of data that are incorrect. That is, if percent agreement is 82, 1.00-0.82 = 0.18, and 18% is the amount of data misrepresents the research data.

**Table 1. Calculation of percent agreement (fictitious data).**

*Table 2. Percent agreement across multiple data collectors (fictitious data).**Cohen’s kappa*

*disagreement*among the raters, the interpretations in Table 3 can be simplified as follows: any kappa below 0.60 indicates inadequate agreement among the raters and little confidence should be placed in the study results. Figure 1 displays the concept of research datasets as consisting of both correct and incorrect data. For Kappa values below zero, although unlikely to occur in research data, when this outcome does occur it is an indicator of a serious problem. A negative kappa represents agreement

*worse*than expected, or disagreement. Low negative values (0 to -0.10) may generally be interpreted as “no agreement”. A large negative kappa represents great disagreement among raters. Data collected under conditions of such disagreement among raters are not meaningful. They are more like random data than properly collected research data or quality clinical laboratory readings. Those data are unlikely to represent the facts of the situation (whether research or clinical data) with any meaningful degree of accuracy. Such a finding requires action to either retrain raters or redesign the instruments.

**Table 3. Interpretation of Cohen’s kappa.**

**Figure 1. Components of data in a research data set.***coefficient of determination*(COD) is directly interpretable. The COD is explained as the amount of variation in the dependent variable that can be explained by the independent variable. While the true COD is calculated only on the Pearson r, an estimate of variance accounted for can be obtained for any correlation statistic by squaring the correlation value. By extension, the squaring the kappa translates conceptually to the amount of accuracy (i.e. the reverse of error) in the data due to congruence among the data collectors. Figure 2 displays an estimate of the amount of correct and incorrect data in research data sets by the level of congruence as measured by either percent agreement or kappa.

**Figure 2.** Graphical representation of amount of correct data by % agreement or squared kappa value.

^{1}represents column 1 marginal

^{2}represents column 2 marginal

^{1}represents row 1 marginal,

^{2}represents row 2 marginal, and

**Figure 3. Data for kappa calculation example.***versus*the automated raters for the kappa (κ = 0.555), but the same data produced an excellent percent agreement of 94.2%. The problem of interpreting these two statistics’ results is this: how shall researchers decide if the raters reliable or not? Are the obtained results indicative of the great majority of patients receiving accurate laboratory results and thus correct medical diagnoses or not? In the same study, the researchers selected one data collector as the standard and compared five other technicians’ results with the standard. While data sufficient to calculate a percent agreement are not provided in the paper, the kappa results were only moderate. How shall the laboratory director know if the results represent good quality readings with only a small amount of disagreement among the trained laboratory technicians, or if a serious problem exists and further training is needed? Unfortunately, the kappa statistic does not provide enough information to make such a decision. Furthermore, a kappa may have such a wide confidence interval (CI) that it includes anything from good to poor agreement.

*Confidence intervals for kappa*

_{κ}) is multiplied. The formula for a confidence interval is:

**κ - 1.96**

*x***SE**

_{κ}to κ + 1.96

*x***SE**

_{κ}_{κ}) the following formula should be used:

_{e }= 0.57, and N = 222

*x*0.037 to 0.85 + 1.96

*x*0.037, which calculates to an interval of 0.77748 to 0.92252 which rounds to a confidence interval of 0.78 to 0.92. It should be noted that the SE

_{κ}is partially dependent upon sample size. The larger the number of observations measured, the smaller the expected standard error. While the kappa can be calculated for fairly small sample sizes (e.g. 5), the CI for such studies is likely to be quite wide resulting in “no agreement” being within the CI. As a general heuristic, sample sizes should not consist of less than 30 comparisons. Sample sizes of 1,000 or more are mathematically most likely to produce very small CIs, which means the estimate of agreement is likely to be very precise.

**Figure 4. Calculation of the kappa statistic.**