The assessment of interpretation of test results in laboratory medicine
Gordon S. Challand
The total laboratory testing cycle, formerly known as Lundberger’s ‘brain-to-brain loop’, describes all steps ranging from ordering the right test to correctly reacting and interpreting test results for effective clinical decision-making (1,2). Forty years ago, the majority of laboratory requests were from hospital-based clinical staff who had received formal training in Pathology and Laboratory Medicine while attending Medical Schools. Except in difficult or in highly specialised Cases, there was little need for laboratory staff to offer advice on the interpretation of results. In many countries there has been a decline in the amount of basic sciences taught at medical undergraduate level(3), the complexity of laboratory diagnostics has dramatically increased, and many clinical staff ask for tests which they themselves are unfamiliar with. There is therefore a growing need for interpretative advice from the laboratory both to help appropriateness and effective interpretation of test results(1,4). Ideally, such advice would be provided through telephone conversations or visits to wards, but this is hampered by the size of present-day workloads, and by the fact that the majority of laboratory requests may come from primary care physicians. In the UK, the majority of laboratories use a Duty Biochemist (a senior member of medical or scientific staff holding a professional qualification) to offer interpretative advice on Clinical Biochemistry results, usually by attaching a brief comment to a report. Although this is itself controversial (5), surveys of clinical staff indicate both that this is very much welcomed and appreciated (6) and that it influences patient management(7). Even in those laboratories where such a service is not provided, an interpretative comment is usually available on direct enquiry by a clinician (8). Clinical Pathology Accreditation (UK) Standards state that interpretation of results is an important component of the service provided by clinical laboratories(9). However, there is little consensus on the required frequency or the user needs for such comments in laboratory medicine, let alone the appropriateness or otherwise of specific comments in a given situation. Ten years ago the Duty Biochemist often worked in isolation with little or no feedback from users of the service or from colleagues, and it is surprising how little evidence based data there is for interpretation even of common laboratory abnormalities.
In 1997 an interpretative exercise ‘Cases for Comment’ began to be distributed through the general discussion mailbase of the ACB (acb-clin-chem-genŽjiscmail.ac.uk)(10); in 2001 this was developed into a formal EQAS for interpretation in Clinical Biochemistry(11). In 2000 a similar scheme was piloted in Australia but without any assessment of the appropriateness of comments as it was felt that participants would get a sense of what was an appropriate comment by peer group comparison(12). This turned out not to be the case: at the end of the pilot survey, both participants and organisers felt the need for formal assessment and feedback on the appropriateness of the comments(13). The form of assessment used by each scheme has been different and each has advantages and disadvantages(14).
Methods of the survey
Cases for Comment (10)
The organiser broke down comments into components, each of which was then allotted a numerical score by a peer review group all group members holding professional qualifications and working in isolation from each other. The mean score for each component then gave an assessment of its comparative appropriateness. An example of a case summary produced using this approach is shown in Table 1. Distributed by e-mail, this was the least formal of the approaches. One hundred Cases were distributed over the four years this Scheme operated.
The UK NEQAS for Interpretative Comments (11,16)
This distributes Cases and Summaries through a web page, and assesses whole comments rather than individual components which prevents the subjective element inherent in breaking down comments into components or key phrases, but again uses independent peer review to allot a numerical score to each comment. The mean score enables ranking of all comments in terms of their appropriateness, and examples of low, median and high scoring comments are included in the summary. An example of this approach is shown in Table 2. At least 20 Cases are distributed each year.
The Australian RCPA-QAP for Interpretative Comments (12,13)
The organiser breaks down comments into key phrases which are assessed by an expert panel consisting of peers holding specified offices in professional bodies who work together to classify them as preferred, of lesser value, or inappropriate. A suggested comment is also made by the expert panel(12). Distributed by conventional mail, it is the most formal of the approaches. In its essence, this is similar to the original ‘Cases for Comment’ in its breakdown of comments into components or key phrases, but it differs in that a consensus verbal view is reached through discussion within the expert panel group, rather than a mean numerical view reached by assessors working independently. An advantage is that the expert panel consensus view is more likely to be accepted by participants than the mean view obtained from independent assessors. A limitation is that the necessity for expert group discussion and the distribution by conventional mail limit the practicable case distribution to around 10 each year.
Results of the survey
Although in general all approaches to assessment give broadly similar results, there is not a particularly good correlation between the assessment schemes (Vasikaran S D, unpublished observation) and on occasions there is a profound difference both between assessors and between members of an expert panel on the value of a comment, whether it is assessed in its entirety or broken down into components (12). There is some evidence from the UK NEQAS that assessors find it easier to agree on what might be called good or bad comments than on intermediate comments; and there may also be differences between what could be called a Teaching Hospital (or specialised centre) approach and a General Hospital (or non-specialised centre) approach, particularly in the extent to which follow-up tests are suggested or carried out (Challand G S, unpublished observations).
Table 1. A ‘Cases for Comment’ summary: comment component marking by independent assessors
A 22 year old girl seeing her family doctor. Clinical information is ‘Termination of pregnancy at end of August. Unprotected intercourse September 20th, followed by ‘morning-after’ pill and a period. Unprotected intercourse October 26th. Positive home pregnancy test. Status?’
Serum HCG was 240 U/L
This Case attracted 46 participants, 8 of whom would phone to discuss the Case.
10 participants commented ‘pregnant’ [0.0]; 2 commented ‘early pregnancy’ [0.8]; 5 commented ‘very early pregnancy [1.0]; 1 said ‘pregnant 1 – 2 weeks’ [-0.3]; 1 said ‘pregnant 3–4 weeks’ [0.5]; 5 said pregnant 4 – 5 weeks’ [1.0].
2 said ‘consistent with 26 October event’ [0.3]; 2 said ‘possibly consistent with 20 September event [0.3]; 6 said the result is too high to be related to the October 26 event [1.0]; 6 said pregnant subsequent to the September 20 event [1.3].
5 said the result could be consistent with incomplete termination of pregnancy [0.3]; 2 said that this could be residual HCG from the earlier termination [0.8]; 2 mentioned the possibility of choriocarcinoma [0.3]; 2 mentioned the possibility of an ectopic pregnancy Š0.8Ć.
12 would suggest a repeat HCG after 2 days [1.5]; 8 would suggest a repeat after 1 week [1.5]; 1 after 2 weeks [0.0]; 1 after 3 weeks [-0.5].
2 would ask for a urine sample for a pregnancy test [-0.5]; 1 would suggest a progesterone with the repeat HCG [-1.0]; 9 would suggest referral for ultrasound [-0.5].
The Scheme Organiser’s comment was ‘Average for day 30 pregnancy, too low for first event and likely to be too high for second. Could be residual HCG from earlier termination, but pregnancy arising from second event cannot be excluded. Suggest repeat after 1 week’.
Table 2. A case summary from the UK NEQAS scheme: whole comment marking by independent assessors
A 23 year old woman was admitted to the Emergency Surgical Unit of the hospital with abdominal pain. The clinical details on the initial request form were ‘RUQ pain’. Serum amylase was within reference limits.
The following day she was transferred to a surgical ward, and a further sample was received, with clinical information ‘Abdominal pain. Tubal ectopic?’.
Serum urea and electrolytes were normal. Serum HCG was 345,000 U/L.
This Case attracted 143 participants, 140 achieving a positive score (median and interquartile range = 1.22 (0.82-1.50)). Nearly all the participants commented that this high HCG level suggested that an ectopic pregnancy was unlikely. Most participants queried if this was caused by a multiple pregnancy or an HCG secreting tumour; half of these comments mentioned the need for an ultrasound scan.
It is interesting that a few DPC Immulite users felt that this HCG was normal for pregnancy. Although DPC quote a median HCG of 118,000 U/L for gestation of 7-11 weeks, their quoted range is up to 291,000 U/L in singleton pregnancies. It is unusual from our experience with the Bayer Centaur to see HCG levels of more than 180,000 U/L in singleton pregnancies.
This lady was transferred to the care of the Obstetricians. Her abdominal pain had resolved and liver function tests were normal. An ultrasound scan later that day showed a twin pregnancy of 10 weeks gestation. She was discharged feeling well the following day.
A low scoring comment was
Results consistent with pregnancy. Repeat amylase to follow on this sample – would expect an increase if a tubal pregnancy has ruptured.
A median scoring comment was
HCG level too high for ectopic, although possible. Trophoblastic tumour should be excluded.
A high scoring comment was
HCG is too high for ectopic pregnancy; in fact it is very high for a normal pregnancy & a twin/ molar pregnancy should be excluded. Suggest scan.
There is no ideal solution to the methods that can be used to assess interpretative comments. The outcome of a Case should not be used, since at the time advice is offered the outcome is unknown, may never be known, and may be an unusual cause of a common set of abnormalities. The consensus views (related to the method mean in analytical QA) cannot be used since a consensus seldom occurs (15,16), and even when it does there may be wide differences of opinion in appropriate follow-up investigations.
Despite being superficially simple, interpretative comments are complex and typically contain at least three distinct ideas (for example suggesting a probable diagnosis; suggesting which diagnoses which can be excluded; and suggesting additional investigations), each of which may be regarded as appropriate or inappropriate. Moreover, a comment requires communication skills. Either the whole comment or its individual components can be assessed, and the assessment can be carried out by an EQAS Organiser, by a group of experts, or by a group of peers, the groups working either together to arrive at a consensus view, or independently to provide an arithmetically average view. An advantage to assessing components is that many of these are common to most participants, and that almost all participants can feel that part of their comment was good. A disadvantage is that the breakdown of each comment into components is rather subjective. An advantage to assessing whole comments is that this avoids a subjective (personally biased) approach, and that it tests communication skills as well as interpretative ideas. A disadvantage is that it is undoubtedly more challenging both for the participants and the assessors. We also have to accept that any assessment process inevitably includes an element of ranking each single comment against the others: there are seldom definitive guidelines which can be used as a basis for assessment.
ISO 15189 clause 5.6.4 states “External quality assessment programmes should, as far as possible, provide clinically relevant challenges that mimic patient samples and that check the entire examination process including pre- and post-examination procedures”(17). Results interpretation offered by laboratories, a post-examination procedure, is thus included. Assessment is widely used in many academic and professional activities, but others have reported on the lack of reproducibility of assessment(18). A recent article identified both intra-assessor and inter-assessor differences, and suggested that a minimum of 16 assessors was needed to arrive at a consistent (but not necessarily correct) verdict in 95% of cases (19). However this extrapolation depends on an implicit assumption that the sole reason for intra- and inter-assessor differences is random variation, akin to intra- and inter-assay imprecision in conventional analytical EQA. This is obviously invalid since non-random differences between different assessors do exist and lead to different outcomes. The first is that different assessors may put different weight on a single inappropriate component in an otherwise appropriate comment, or on communication skills, and may use different and contradictory guidelines for assessing the appropriateness of a comment (akin to calibration differences between different assays). The second is more fundamental: different assessors may come to different conclusions based upon their individual past experiences which may differ widely. This is akin to different immunoassays attempting to quantify a complex mixture of fragments and intact molecules using different antibodies recognising different epitopes or combinations of epitopes. In the situation of non-random variables influencing different assessors, it is impossible to define how many assessors are needed to achieve the same majority (but not necessarily the correct) verdict in 95% of cases. Among other circumstances, this depends on the relative proportions of assessors holding one view as opposed to another. For example, if a question is asked which half the assessors believe the answer is ‘yes’ and the other half ‘no’ there is (give or take random variation) never a majority verdict. The mean answer is ‘maybe’, which satisfies neither half, but may be the correct verdict.
This would be of major concern if the majority assessment verdict were used as the ‘gold standard’ (which it often is in many areas of academic and professional activity). For interpretation, an assessment verdict can only be regarded as a guide as to which comments may be viewed as more or less appropriate. In the current schemes, there are several ‘over-rides’ to prevent the highest scoring comment or key phrase being necessarily regarded as the gold standard. The first is to use the actual outcome of the Case when this is known. The second is to use the opinion of the scheme organiser, reinforced if necessary by seeking expert opinion. Neither of these over-rides is in itself a gold standard: the outcome of a Case cannot be regarded as such since it may have been a rare cause of a common set of abnormalities; and experience from the UK NEQAS is that experts sometimes do not realise that they hold extreme minority (but not necessarily incorrect) opinions. We have to accept that assessment of any sort is imperfect; that the primary purpose of interpretative EQAS is educational (20); and that assessment of any kind is a guide to participants, not a definition of exact solutions which seldom if ever exist. Although assessment is imperfect, it is an essential component of any EQAS assessing interpretation, and there is evidence that assessment procedures incorporated in an EQA scheme improve the standard of interpretation (20).
Potential conflict of interest
1. Lippi G, Fostini R, Guidi GC. Quality improvement in laboratory medicine: extra-analytical issues. Clin Lab Med 2008;28:285-94.
2. Plebani M. Laboratory errors: How to improve pre- and post-analytical phases? Biochemia Medica 2007;17:5-9.
3. Freedman DB. Is the medical undergraduate curriculum ‘fit for purpose’? Ann Clin Biochem 2008;45:1-2.
4. Simundic AM, Topic E. Quality indicators. Biochemia Medica 2008;18:311-9.
5. Kilpatrick ES, Barth JH. Whither clinical validation? Ann Clin Biochem 2006;43:171–2.
6. Barlow IM. Are biochemistry interpretative comments helpful? Results of a general practitioner and nurse practitioner survey. Ann Clin Biochem 2008;45:88-90.
7. Barlow IM. Do interpretative comments influence patient management and do our users approve of the laboratory ‘adding on’ requests? A follow-up on?of? General Practitioner and Nurse Practitioner survey. Ann Clin Biochem 2009;46:65-86.
8. Sciavolli L, Zardo L, Secchiero S, Zaninotti M, Plebani M. Interpretative comments and reference ranges in EQA programs as a tool for improving laboratory appropriateness and effectiveness. Clin Chim Acta 2003;333:209–19.
9. Clinical Pathology Accreditation (UK). Standards for the medical laboratory.
10. Challand G. Cases for comment, education and audit. JIFCC 1998;10:53-5.
11. Challand GS, Osypiw JC, MacKenzie F, Middle JG, Fister D, Stezhka L. Web-based external quality assessment of individual skills – the UK NEQAS for interpretative comments in Clinical Chemistry. EuroMedLab 2001, Prague, abstract S259
12. Vasikaran SD, Penberthy L, Gill J, Scott S, Sikaris KA. Review of a pilot Quality Assessment Program for interpretative commenting. Ann Clin Biochem 2002;39:261–72.
13. Lim EM, Sikaris KA, Gill J, Calleja J, Hickman PE, Beilby J, Vasikaran SD. Quality assessment of interpretative commenting in Clinical Chemistry. Clin Chem 2004;50:632–37.
14. Challand GS, Vasikaran SD. The assessment of interpretation in Clinical Biochemistry: a personal view. Ann Clin Biochem 2007;44:101–5.
15. Challand GS. Assessing the quality of comments on reports: a retrospective study. Ann Clin Biochem 1999;36:316–32.
16. Li P, Challand GS. Experience with assessing the quality of comments on Clinical Biochemistry reports. Ann Clin Biochem 1999;36:759–65.
18. Goldman RL. The reliability of peer assessments: a meta-analysis. Eval Health Prof 1994;17(1):3–21.
19. Bindels R, Hasman A, van Wersch JWJ, Pop, P, Winkens RAG. The reliability of assessing the appropriateness of requested diagnostic tests. Med Decis Making 2003;23:31–7.
20. Osypiw JC, Challand GS. ‘Cases for Comment’ on the ACB mailbase: were they educational? Proceedings of the ACB National Meeting 2002: A105: p85.