Statistical treatments in proficiency tests: Application of the median for detecting anomalous results

J. Laso Sánchez1, A. Peris García-Patrón1 1 Quality Services Office S.A.L., C/ Caridad 32, 28007 Madrid, e-mail: gscsal@gscsal.com The aim of this work is to analyze different systems of statistical treatments for obtaining consensus values in proficiency testing schemes. In this article, we will discuss a new approach for the preliminary treatment of participants’ results, as an alternative to other outlier rejection tests. This method uses the effectiveness of the median as an elimination element, without modifying the participants’ results. This test has been developed by GSC SAL and applied in multiple intercomparison schemes since 2001. 1.- Introduction The evaluation of the quality of test results includes a wide range of activities, such as precision testing under different conditions, the use of reference materials, and intercomparison. Each of these provides different information about the characteristics of the method or the maintenance of its properties. Intercomparison is a particularly powerful tool to achieve other objectives: to control and verify our uncertainties and ensure that the declared values are true, or even to provide the necessary data to perform a formal validation of our testing methods. In the case of interlaboratory comparison tests, the data provided by participating laboratories often, as confirmed by experimental evidence, follow models similar to normal distributions, which frequently include “anomalous” observations with respect to the dataset. The preliminary treatment of these data involves minimizing or eliminating the statistical weight of these anomalous observations, with the objective of establishing appropriate parameters to evaluate the results of the participating laboratories. 2. The need to apply outlier detection tests The ultimate goal of applying outlier detection tests is to obtain reliable parameters for the intercomparison exercise, which are usually used for calculating the evaluation or classification parameter known as the z-score. They are also applicable for estimating other classification statistics such as En or z′. 3. Outlier detection tests Regarding the detection of statistically anomalous results, the main standards that establish systematic approaches for identifying and treating outliers are ISO 5725 and, in some cases, ISO 35. ISO 43 also provides some proposals on this subject. The orientation of anomalous behavior in data has, until recently, had a dual objective: on the one hand, the detection of anomalous data in terms of precision (usually repeatability), compared to the precision of the measurements provided by the group of laboratories; and on the other hand, the detection of data (means or individual observations) from participants that deviate from the most probable value assigned by the laboratories. The detection of both “problems”—precision and accuracy—has driven the development of different tests, which are well known in the field of intercomparison. Some of them have been published in International Standards, such as those mentioned above. In general, these tests that identify statistically anomalous results are aimed, once such results are detected, at:
  1. correcting the data considered as such
  2. removing from the dataset those results that statistically differ from the rest.
Since anomalous behavior of results is generally attributable to human errors, misapplication of methods, or test requirements (for example, units of expression or final calculations), the correction of such results may not be the most plausible tactic, or at least is debatable. Once these data have been detected (and either removed or corrected), it is possible to assign to the exercise, for the analyte under test, its “parameters”, VA and s. In practice, discrimination tests based on precision, such as Cochran’s test, have fallen into disuse. It has been demonstrated, or at least strongly suspected, that in proficiency testing, elimination based on repeatability can be considered “unfair.” Often laboratories interpret test conditions differently from those expected: although results are requested as more than one observation (generally duplicates), laboratories do not always report them, or the reported observations are identical or so similar that it raises suspicion that repetition was not performed from the initial sample but rather from a subsample within the process (for example, two determinations from the same extract). On the other hand, since proficiency testing allows laboratories to apply methods with different known precisions, the elimination of certain data compared to others may be biased if the latter come from more widely used, more precise methods. A particularly illustrative case is the determination of density by hydrometric methods versus electronic densimetry. Likewise, the elimination of an anomalous result based on repeatability has no real repercussion for the participant, beyond exclusion from the dataset used to calculate consensus parameters, since the z-score evaluation is ultimately applied to their mean result. These tests are gradually disappearing in proficiency testing, although in our opinion, evaluating repeatability remains a useful tool that helps laboratories assess whether their mean was appropriate, and therefore understand the potential cause of a non-satisfactory z-score. GSC has also developed an informative system for evaluating the repeatability of a laboratory in relation to the group. Continuing with outlier detection statistics—this time focusing on those results that deviate most from the value considered the most probable or expected—many tests have been employed, due to their simplicity and even their inclusion with didactic examples in reference standards (such as Grubbs’ test, published with application examples in ISO 5725). Among these, although also currently in decline due to their limited capacity to identify outliers when they are masked by grouping with other potentially anomalous results, we find for example: Dixon’s Q test, derived from a significance test that evaluates whether or not the Q statistic, as calculated, complies with the critical value established in tables: (1) Grubbs’ test, a significance test that uses the G statistic as a calculation, comparing it with tabulated values: (2) Or its double Grubbs variant, through the estimation of variances of all results, and after removing the two most unfavorable ones: (3) Dixon’s test has been used in some schemes until quite recently. The problem with applying the above-mentioned tests is that, in general, elimination is not entirely effective. The effectiveness of identifying outliers and their recurrent elimination has been shown to be limited in these cases, since their application assumes:
  1. a sufficiently large data population
  2. that the data follow a distribution close to normal
  3. the absence of “multiple” or clustered outliers.
Therefore, and to overcome the impossibility of meeting the above conditions in many situations, there is a current trend towards applying outlier detection procedures based on robust statistics. 4.- Robust statistics in outlier detection In this case, more modern criteria rely on robust statistics based on the properties of the median, which are not as affected by the type of population. The application of robust statistics seems to define the current landscape of outlier detection, where the evaluator decides whether to eliminate or transform them. ISO 13528, regarding the treatment of data provided by participants, is characterized by:
  • Establishing possible systematic approaches for assigning values to the central value, including the use of robust statistics, and for the variability value, generally based on the use of the target s.
  • Establishing, likewise, the advisability of comparing the actual s obtained with respect to the target s, considering that it should not exceed this by a critical factor of 1.2.
A fundamental characteristic of the standard is that it uses all the values obtained by participants, without discarding any, but modifying those considered outliers. The Algorithm A system, established in the aforementioned standard, is based on a recursive process until convergence of the obtained data. It obtains a central value as a mean and a robust standard deviation. If xi is the value of laboratory i among a total of p laboratories: x* = median (xi) s* = 1.483 · median /xi – x*/ j = 1.5 · s* is calculated The initial values xi are then replaced according to the following rule: xi* = x* – j if xi < x* – j x* + j if xi > x* + j xi, in all other cases In this way, anomalous data are replaced for calculations with the extreme value, so that: The new x* and s* are calculated as: (4) (5) The process is repeated until convergence. 5.- Robust statistics for outlier elimination Understanding that robust statistics and the use of the median as a fundamental element represented the closest and safest horizon for outlier detection, Gabinete de Servicios para la Calidad has developed and applied its own systematic method since 2001. The assumptions underlying this methodology are:
    • The population of laboratories follows a Gaussian distribution.
    • There are laboratories with anomalous results that alter the distribution.
    • Anomalous values must be removed in order to calculate the true parameters of the population.
    • Anomalous values are located at the extremes.
    • Central values allow estimating the real data of the population while minimizing the influence of outliers.
The methodology is developed in the following steps:
    1. Obtain the median of the results provided by the participating laboratories, hereinafter Me.
Me = Median (xi)
    • For each participant i, of the total participants p, calculate the absolute difference between the value obtained VLi and the Median of the set Me:
di = |VLi – Me|
    • Obtain the median of these differences, Medi:
Medi = Median (di) This value should correspond to the point where 50% of the population is located.
    • Calculation of the maximum admissible dispersion, estimated as a function of the number of participating laboratories, with a 50% probability interval, according to the equation:
(6) Where t is the Student’s t (two-tailed) for a = 0.50 and n-1 degrees of freedom, equivalent to the n-1 data used in the estimation of the median. Thus, s and subsequently the acceptance interval, are made dependent on the number of participants n.
    • smax will allow establishing an interval:
(7) Such that, results outside it will be considered statistically anomalous, i.e., outside the population with a 95% statistical confidence, and will be eliminated. Note: other intervals, such as 99%, can be used to improve convergence. Thus, exclusion is based on the difference found between each individual value and the resulting median, initially accepted as the best estimate of the central value. This test will be applied recurrently, if deemed necessary, so that, after eliminating anomalies, with the remaining results the following will be carried out:
    • Calculate the mean of the remaining results, which will be considered as the Consensus Value or Assigned Value VA, used in the estimation of the z-score.
    • Calculate s, as the standard deviation of the non-excluded data, which in general will be used in the evaluation of the z-score.
6.- Evaluation of the exercise parameters. Estimation of the z-score. The elimination system presented incorporates particular characteristics, among which are:
    • Estimation of intervals using the “median” statistic, less dependent on the population’s normality, which is especially problematic when the number of data points is small.
    • Adaptation of acceptance intervals as a function of the number of participants (Student’s t).
    • No manipulation or correction of laboratory data to limit values not provided by the participants, which in many cases cannot be guaranteed. Recall that the data correction proposed in Algorithm A implies assigning limit values to participants with unacceptable results, creating extreme values on both sides of the population that affect the calculation of a fictitious robust standard deviation.
Thus, it is possible to determine exercise parameters with the following characteristics:
    • Assigned value, calculated whenever possible, through the consensus value of the laboratories. The assigned value will be the mean of the data not eliminated as statistical anomalies which, if the accepted data behave in a “quasi-normal” way, should statistically coincide with the median.
    • Establishment of the s value, which will denote the degree of confidence or reliability in the assigned value used in the z-score calculation.
    • This s value, calculated as the experimental standard deviation, may be used, subject to assessment, for z-score calculation, although other schemes are possible.
    • The evaluator must describe the origin of each parameter of the exercise so that, if necessary, the participant may carry out their own evaluation if considered appropriate.
7.- Treatment examples First, we will compare the behavior of three of the tests mentioned in the article: Grubbs Test, Algorithm A, and GSC Robust Median. In this case, the following results were obtained in an intercomparison of a heavy metal in food (results expressed in ppb): Table 1.
Participant Mean Participant Mean
1 180.05 11 198
2 350 12 199
3 322.9 13 224
4 126.5 14 222.95
5 244.99 15 234.65
6 225 16 288.25
7 220.8 17 210
9 205.2 18 222.15
10 181 19 241.05
The values written in italics represent those eliminated by the GSC test, as specified in the summary table, which includes both the initial results and those obtained after applying the statistical treatment of the mentioned tests: Table 2.
TEST RESULTS
Parameter Initial Data Simple/Double Grubbs Results Algorithm A GSC System
Mean 228 228 217 215
Median 223 223 223 222
S 51.7 51.7 20.4 20.3
N 18 18 18 14
The issue with many elimination tests is confirmed in this example: the impossibility of eliminating clearly anomalous results. The results from Algorithm A and GSC are equivalent, after applying a single iteration (further iterations do not produce elimination). On the other hand, to assess the adequacy of the proposed systematic approach, the data presented in the examples from Annex III of the IUPAC protocol were processed using the GSC test, to confirm the results obtained with respect to the proposed system. 7.1. IUPAC Example 1: Unimodal and symmetrical distribution (% mass property). The results of the treatment carried out by the GSC system were as follows: Table 3.
G.S.C. SYSTEM DATA
Initial Iteration 1
Mean 53.103 53.307
Median 53.297 53.31
S 1.962 0.5036
N 68 60
Median differences 0.3805 0.32
Theoretical S 0.561 0.471
A 3s interval was used to define elimination limits, due to the symmetry of the distribution.