Description of the task
Reference intervals (1, 2) are amongst the essential tools for interpreting laboratory results. Excellent theoretical and practical studies have been published since the 1960ies for establishing reference intervals as a concept in contrast to “normal intervals” (3), establishing principles for selecting reference persons (4) and performing the appropriate statistical data analysis (5-8). Reference intervals continue to be an active area of research (2, 9) and at the core of practical work in clinical laboratories (6, 10).
The reference interval includes 95% the results/values measured in a representative sample of reference subjects and is bounded by lower and upper reference limits marking the 2.5 and 97.5 percentiles respectively (Figure 1). The reference interval including the upper and lower reference limits are naturally determined in a ranked set of data values if the number of observations is large, e.g. > 1000. When the number of values is small, e.g. in the order of 40-120 and the data are non-Gaussian, resampling methods are useful to estimate the reference interval. The reference population is commonly a population of apparently healthy individuals, but may also consist of any other well-defined population of interest for diagnostic/comparative purposes.
Different methods for calculating reference intervals
Estimating reference intervals means dealing with uncertainties and probabilities. All probabilistic methods are based on assumptions about a theoretical distribution which fundamentally determines the conclusion that can be drawn from the data. The most commonly used probabilistic methods are the parametric methods that assume that observations in the population are distributed according to the Gaussian/Normal distribution. These are the methods of choice if the data are Gaussian or can be transformed to that distribution since the data themselves with the added knowledge of the distribution of the data in the sample and population enables the user of the data to draw firmer conclusions than if only the data are known.
When the data are non-Gaussian and cannot be transformed to the Gaussian distribution the analyst is left with non-parametric or resampling methods for determining reference intervals. Data from 120 reference intervals are needed for reliably determining non-parametric reference intervals (7, 11). Resampling/bootstrap methods (12) take numerous repeated sub- samples with replacement of the available data in order to estimate the distribution of the data in the population, including the reference intervals.
Resampling methods with replacement are commonly called bootstrap methods by many believed to refer to the fairytale of Baron Munchausen pulling himself and his horse out of a swamp by his hair. Bootstraps are in fact the tab, loop or handle at the top of boots allowing the use of fingers pull the boots on. The concept of bootstrap is therefore a metaphor for clever, self-induced salvaging efforts. There is a wealth of current introductory literature on resampling methods in general (13) and resampling methods for Microsoft Excel® in particular (14).
Resampling with replacement means that a large number of samples with replacement are then taken from the original sample and the statistic of interest is calculated from this pseudo-population as estimate of the corresponding parameter of the population (Figure 2).
Percentiles
A proper start when calculating reference intervals is to estimate the observation(s) corresponding to 2.5% and 97.5% in a ranked list of reference value data. Both International Federation of Clinical Chemistry (IFCC) (5-7, 15) and Clinical and Laboratory Standards Institute (CLSI) (16) recommend the following formulas: lower limit has the rank number 0.025 x (n+1) and the upper limit the rank number 0.975 x (n+1). This method is practicable mainly when the number of reference samples is 120 or more.
While there is agreement on how to calculate the median, agreement on how to calculate quantiles including percentiles is lacking due to the fact that interpolation is frequently needed, especially when the number of observations is small. Unfortunately there is no general agreement on the best way of performing interpolations. In 1996 Hyndman and Fan published an influential paper on calculating quantiles (17). It evaluated the methods used by popular statistics packages with the intention to find a consensus and basis for standardization. Of the nine formulas commonly used, four satisfied five of the six properties desirable for a sample quantile. Hyndman and Fan felt that the “Linear interpolation of the approximate medians for order statistics” (used by the Excel =PERCENTILE.EXC function) was best due to the approximately median-unbiased estimates of the quantiles, regardless of the distribution (17).
Unfortunately only a limited movement towards standardization has taken place since 1996 (18). Most statistical programs stick to their traditions when calculating quantiles probably to maintain backwards computability with older versions.
CLSI has published an authoritative and practical standard “Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory: Approved Guideline EP28-A3c” (11) and the whole subject is authoritatively summarized by Harris and Boyd (7).
Several excellent software packages are available for the statistical analysis of reference intervals (5, 19-23) but they are not as commonly used as they should e.g. due to cost constraints. Recently Geffré et al. published “Reference Value Advisor”, a set of comprehensive freeware macros for Microsoft Excel® to calculate reference intervals (24). It – together with the original paper - can be downloaded at http://www.biostat.envt.fr/spip/spip.php?article63 which implements the procedures recommended e.g. by EP28-A3c. Increased availability of low-cost computing machinery and resampling methods has made distribution independent computer intensive methods accessible in routine work (25).
The fundamental purpose of statistics in general and statistics in reference interval estimation is to extrapolate from a limited number of observations/data to the whole population of observations. All statistical methods are based on assumptions which fundamentally decide the conclusion that can be made from the data. Parametric statistical methods which are the most widely used, assume that the population values are distributed according to a mathematically well-defined distribution - the Gaussian/Normal distribution. Parametric methods were initially described in the first half of the twentieth century when methods for performing extensive calculations were primarily manual and semi-automated at best. Using distribution functions to describe natural phenomena including research data, substantially simplified the calculations required for statistical analysis. Without these known statistical distribution functions it would have been impossible to use advanced statistical methods in practice.
The wide availability of low-cost computing power in the 1980ies substantially changed the options for working with the data as they turn out irrespective of theoretical distributions. A seminal work on resampling methods by Jones in 1956 (26) was followed by the very influential works by Efron et al. (12, 25) which brought these methods in the mainstream of data analysis.
The purposes of the present paper are to provide detailed instructions on the use of distribution-independent computer intensive resampling/bootstrap methods using standard functions of Microsoft Excel® 2010 or later for estimating reference intervals and to put resampling methods into a theoretical and practical perspective.
Description of the resampling method
Microsoft Excel® 2013 for Windows was used on a personal computer running Windows 7 operating system on an Intel® Core® i7-3520M CPU 2.90GHz processor and 16 GB RAM. The 83 reference results were obtained from a data set of reference values of creatinine published by Geffré et al. (24).
A simple method based on ranked results was initially used (27, 28). The results were sorted using the sorting function in Microsoft Excel® and the rank numbers of the reference interval computed using the following formulas: lower limit has the rank number 0.025 x (n+1) and the upper limit the rank number 0.975 x (n+1) where n is the total number of reference samples.
The first randomly drawn reference value
The first randomly drawn observation in the first cell C1 was created by the following function =INDEX($A$1:$A$10000,RANDBETWEEN(1,COUNT($A$1:$A$10000))) (Figure 2).
The INDEX function in Microsoft Excel® is used to locate a single data value amongst data in a table by specifying the row and column numbers corresponding to the data point to be retrieved. The format is = INDEX(range_to_search_in, row, column).
The $ sign before the row and column numbers is used to mark the fixed data column and row to be used (Figure 2).
The COUNT function in Microsoft Excel® totals the number of cells in a selected range of cells that contain a specific type of data, e.g. cells containing data in a range of cells that can contain empty cells. In this case it counts the number of observations entered between row 1 and 10 000 - COUNT($A$1:$A$10000). It is used here in order to eliminate the need for the user to keep track of the number of reference values in the reference sample.
RANDBETWEEN(1,COUNT($A$1:$A$10000)) picks a random number from the reference sample values entered in column A, in this case 83 reference values. You can speed up the calculation by giving the number of data points in your particular data set, e.g. in this case by RANDBETWEEN(1,83) or the complete formula =INDEX($A$1:$A$83,RANDBETWEEN(1,83)). The increased speed is caused by the fact that Excel then does not need to repeatedly determine the number of reference values you have entered.
Subsequent randomly drawn reference values
Drag the small quadrant in the lower right corner of cell C1 to column AG83 thereby creating 30 samples, each of 83 random observations from the original observations. Extending the columns to the right you can increase the number of samples e.g. to 1000 random samples.
Calculating the 0.025 and 0.975 percentiles
Calculate the percentiles, e.g. in row 85 and row 86, respectively using the formulas =PERCENTILE.EXC(I1:I83;0.025) and =PERCENTILE.EXC(I1:I83;0.975), respectively (Figure 2). The PERCENTILE.EXC function, available only in 2010 – or later versions of MS Office is used in order to implement a recommended method for percentile value interpolation (17). The Microsoft Excel® simple = PERCENTILE function will not work for the present purpose (see below). The error “#NUM!” in the PERCENTILE.EXC function occurs if the supplied value of k is is < 1/(n+1) or > n/(n+1) (where n is the number of reference sample values) or - the supplied array is empty. This means that random samples of at least 39 observations are needed to calculate the 2.5 and 97.5 percentiles.
Calculate the medians of the random sample estimates of the population percentiles
Lastly calculate the median of the samples estimates of the population 0.025, and 0.975 percentiles using the median function e.g. = MEDIAN(C85:AG85)and =MEDIAN(C86:AG86) (Figure 2).
Detailed instructions to create the calculation sheet are presented in table 1.
Table 1
30 samples of 83 observations each resulted in the reference interval 53.5-165.8, 333 and 1000 samples of 83 observations each and resulted identical reference intervals of 53.9-147 (Figure 3). In comparison the reference interval calculated by Geffré et al. was 53.4-146.9 (24).
The computer time needed for re-calculating 1000 samples of 83 observations each from a sample of 83 observations was less than 1 second.
Discussion
The resampling model built by standard functions of Microsoft Excel® 2013 and illustrated here resulted in the practically identical reference interval calculated by Microsoft Visual Basic for Applications and described by Geffré et al. (24).
Resampling methods are free from the assumption that the observations are distributed according to a certain theoretical distribution, but importantly assume that the underlying population distribution is practically the same as that in a particular sample. This means that a sufficient number of observations is needed in the sample to make sure that it represents the population. Tu and Zhang have found that applying resampling methods to a random sample of 50 from true Gaussian distribution only resulted in coverage of the true confidence interval estimate in 88% of the cases, even though the improved bias-corrected-and-accelerated bootstrap (29) is used. Good (13) recommends in the order of 100 observations in the sample which is close to the 120 individuals/observations recommended by IFCC (8, 30) and the CLSI (16). Geffré et al. (24) found that resampling methods are not appropriate when the number of reference samples less than 40. Coscun et al. (31) calculated reference intervals for glucose, creatinine, blood urea nitrogen (BUN), and triglycerides from random samples from 20 to 120 individuals using resampling methods. Their results confirm that 40 reference samples represent a prudent minimum. At least 500 but preferably 1000 or more (31) resamples are generally recommended which is supported by the results of the present study.
Bjerner et al. (32) have shown that parametric methods may be preferable to resampling methods even when the number of reference samples is less than 120 when the distributions of observations in the reference samples is Gaussian or can transformed to that distribution.
Microsoft Excel®
Microsoft Excel® is currently the most widely used and powerful spreadsheet program. Its widespread availability and versatility constitutes its major advantages enabling the users to perform powerful descriptive, inferential graphical and statistical analysis using even methods not yet programmed in statistical packages. The program has, especially in its earlier versions been criticized for its mathematical (33), statistical (34) and graphical properties. Several of the limitations have been overcome in the most recent versions of Excel (35), since 2010.
Is Microsoft Excel® appropriate for resampling estimations?
Amongst the advantages of using Microsoft Excel® for calculations in general and for resampling in particular is that the software is widely available and used for administrative purposes in the laboratory and facilitates easy data handling and visualization. Using the natural functions of Microsoft Excel® 2010 or later the resampling procedures are made highly visual, providing the user with hands on visual contact with all the random samples taken and the calculations made. This is, however, also the disadvantage of the use of the natural functions gallery of Microsoft Excel® for this purpose since there is a risk of errors when performing calculation on large tables of data – e.g. the 1000 columns needed for the present purpose.
Conclusion
Resampling methods are recommended by IFCC and are widely available in software for handling reference values (5, 19-24) but not as yet as yet a part of the CLSI guideline (16). Resampling methods are appropriate when the distribution of data from the reference samples is non-Gaussian and in case the number of reference samples are in order of at least 40. At least 500-1000 random samples with replacement should be taken from the reference samples. Resampling methods are conveniently implemented in versions of Microsoft Excel® 2010 or later as shown here.