The borderline range of toxicological methods : Quantification and implications for evaluating precision

Testing methods to assess the skin sensitisation potential of a substance usually use threshold criteria to dichotomise continuous experimental read-outs into yes/no conclusions. The threshold criteria are prescribed in the respective OECD test guidelines and the conclusion is used for regulatory hazard assessment, i.e. classification and labelling of the substance. We can identify a borderline range (BR) around the classification threshold within which test results are non-conclusive due to a testing method’s biological and technical variability. We quantify BRs in the prediction models of the non-animal testing methods DPRA, LuSens and h-CLAT, and of the animal test LLNA, respectively. Depending on the size of the BR we find that between 6% and 28% of the substances in the sets tested with these methods were considered borderline. If the results of individual nonanimal test methods are combined into integrated testing strategies (ITS), borderline test results of individual tests can also affect the overall assessment of the skin sensitisation potential of the testing strategy. This was analysed for the ‘2-out-of-3’ ITS: Four out of 40 substances (10%) were considered borderline. Based on our findings we propose expanding the standard binary classification of substances into ‘positive’/’negative’ or ‘hazardous’/’nonhazardous’ by adding a ‘borderline’ or ‘non-conclusive’ alert for cases where test results fall within the borderline range.


Introduction
Skin sensitisers are substances that can lead to an allergic response following skin contact (UNECE, 2011).An individual may be sensitised upon first contact.Subsequent contact can then provoke allergic contact dermatitis (ACD).It is estimated that ACD affects about 20% of the European and North American population at least once in their lifetime, although there is considerable variation of skin sensitisation prevalence between different age-sex groups (Thyssen et al., 2007).Data on skin sensitisation potential have to be provided for all substances produced or manufactured above one tonne per year under the European chemicals legislation REACH, and for classification and labelling of substances under the European CLP regulation (ECHA 2016).The assessment of a substance's skin sensitisation potential has been traditionally based on data derived from animal tests such as the guinea pig based tests described in OECD TG no. 406 (OECD, 1992) or the murine local lymph node assay (LLNA) described in OECD TG no.429 (OECD, 2002 and2010).However, animal welfare concerns, and the regulatory enforcement e.g. by the Cosmetics Regulation (European Commission, 2009) and the REACH legislation (European Commission, 2006) have driven efforts to move away from animal to non-animal testing.A number of non-animal testing methods have been developed (Mehling et al., 2012;Reisinger et al., 2015), two of which, namely the direct peptide reactivity assay (DPRA) (Gerberick et al., 2004;Gerberick et al., 2007) and the antioxidant response elementnuclear factor erythroid 2 (ARE-Nrf2) luciferase testing methods covered by KeratinoSens TM (Natsch et al., 2011), have been validated by the European Centre for Validation of Alternative Methods (ECVAM; Italy) and are described in the OECD TG no.442C and no.442D (OECD, 2015a, b).LuSens (Ramirez et al., 2014;Ramirez et al., 2016) also covers the ARE-Nrf2 luciferase testing method and is currently undergoing validation.Another non-animal testing method, the human cell line activation test (h-CLAT) (Ashikaga et al., 2010;Ashikaga et al., 2006;Sakaguchi et al., 2006;Sakaguchi et al., 2010) has recently been validated by ECVAM and is described in OECD TG no.442E (OECD, 2016).The sequential structure of molecular and cellular mechanisms causing ACD is represented by the "adverse outcome pathway" (AOP) for skin sensitisation, consisting of eleven causally linked steps, four of which were defined to be essential and specific ("key events") (OECD, 2012a, b).The DPRA, the ARE-Nrf2 testing methods and the h-CLAT cover the first three key events of the skin sensitisation AOP.
For hazard classification purposes, i.e. for assessing skin sensitisation potential, continuous data obtained from animal tests or from non-animal testing methods are dichotomised into binary 'positive'/'negative' information (Van der Schouw et al., 1995;Hoffmann and Hartung, 2005).The prediction models used for the DPRA, LuSens and the h-CLAT are described in OECD TG no.442C (OECD, 2015a), Ramirez et al. (2014 and2016), and in the OECD TG no.442E (OECD, 2016), respectively.Based on the threshold for classification a testing method's accuracy, i.e. the percentage of true positive and true negative classifications, can be determined (see, for example, Yerushalmy, 1947, Cooper et al., 1979).The experimental data obtained from a testing method are, however, subject to biological and technical variability.Consequently, repeated testing may result in discordant classification results.This affects the precision of a testing method, defined as the ability of a testing method to deliver concordant results in repeated applications.The problem of intra-and inter-assay variability of in vitro methods has been observed earlier; see Hothorn (2002 and2003).Luechtefeld et al. (2016) pointed to a limited intra-assay reproducibility of skin sensitisation potential and potency data.
This paper focuses on the intra-assay variability of testing methods for skin sensitisation potential assessment.Specifically, we analyse limitations with regard to the reproducibility of results when continuous dose-response data are transformed into 'toxic'/ 'non-toxic' outcomes.Kolle et al. (2013), Hoffmann (2015), Dumont et al., (2016), andDimitrov et al. (2016) analysed the intra-assay variability of the LLNA.Kolle et al. (2013) showed that for those substances for which the estimated concentration (EC3) leads to a simulation (SI) index value which was relatively close to the threshold for classification (i.e.SI = 3; Kolle et al., 2013), repeated testing resulted in positive and negative classifications of their skin sensitisation potential.Kolle et al. (2013) defined a range around the classification threshold of the LLNA, within which discordant outcomes can be expected, by determining coefficients of variation based on individual animal data.This range is called "borderline range" (BR) (Kolle et al. 2013) or "grey zone" (Dimitrov et al., 2016).The percentage of substances falling into the BR of a testing method's prediction model reflects a testing method's limited precision.
Analyses of the BR for non-animal testing methods used for skin sensitisation potential assessment have not been conducted so far.The aim of this paper is, therefore, to examine the impact of technical and biological variability on the precision of selected non-animal testing methods for skin sensitisation potential assessment.Moreover, we examine how the precision of the non-animal testing methods and that of the animal test LLNA is affected by variations of the BR.For this purpose we suggest an approach to quantify BRs for the non-animal testing methods DPRA, LuSens, h-CLAT and the LLNA, based on results obtained from a large number of experiments.The approach to quantify the BR, and the decision rules for applying the BR to the prediction models of individual testing methods, are described in Section 2. Results from quantifying the BR for each individual testing method are presented in Section 3.1.Borderline substances (i.e.substances which produced test results in the BR) detected in the experimental sets of individual testing methods are shown in Section 3.2.In addition, we suggest a decision rule for applying the BR to a combination of the DPRA, LuSens and the h-CLAT into the '2-out-of-3' ITS.Section 3.3 shows borderline substances for the '2-out-of-3' ITS.Section 4 discusses implications from considering the BR in non-animal testing methods' prediction models, the LLNA, and the '2-out-of-3' ITS, respectively.Section 5 concludes.

2
Materials and methods

Testing methods
The three non-animal testing methods DPRA, LuSens, and h-CLAT were developed to address the three key events of the AOP in order to assess a substance's skin sensitisation potential.We compared our findings to those of the LLNA as in vivo reference test in order to evaluate the precision of these methods.The samples used for quantifying the BR contained 42 substances in case of the DPRA, 26 substances in case of LuSens, 13 substances in case of the h-CLAT, and 22 substances in case of the LLNA, respectively.The BR was quantified using results from a large number of experimental runs of each testing method.Information about the samples used for determining the BR for each non-animal testing method and the LLNA, the number of experimental runs conducted and the substance concentrations used in the experiments is provided in Appendix 1, Tables 1-4.Where substance names could not be provided due to data confidentiality substances were numbered consecutively.
The experimental sets to which the BR concept was applied in order to identify borderline substances consists of 199 substances in case of the DPRA, 79 in case of LuSens, 40 in case of the h-CLAT, and 22 substances in case of the LLNA; see Bauch et al. (2012) and Urbisch et al. (2015Urbisch et al. ( , 2016)).The composition of these sets is presented in Appendix 3, Tables 1-4.

The Local Lymph Node Assay
The Local Lymph Node Assay (LLNA) has been the 'first choice' animal test for the assessment of skin sensitisation potential (Kimber et al., 1994).It is described in OECD TG 429, which was first published in 2002and updated in 2010(OECD, 2002and 2010).In the LLNA, the proliferation of lymphocytes in auricular draining lymph nodes induced by substances is quantified by comparing the mean proliferation in each test group to the mean proliferation in the vehicle treated control group.The ratio of the mean proliferation in each treated group to that in the concurrent vehicle control group, termed the Stimulation Index (SI), is determined.The classification threshold T of the LLNA is .If SI > 3 a substance is classified a skin sensitiser.

The Direct Peptide Reactivity Assay
The Direct Peptide Reactivity Assay (DPRA) was developed by (Gerberick et al., 2004;Gerberick et al., 2007).The DPRA has been formally validated and the OECD Testing Guideline TG 442C (OECD, 2015a) was adopted in 2015.In the DPRA depletions of two model peptides containing a cysteine-or lysine-residue as a reactive nucleophilic centre are measured after incubation with a test substance.The classification threshold T of the DPRA is the mean depletion of 6.38% of the two peptides compared to the depletion in the reference controls (OECD, 2015a).If the mean lysine-and cysteine-peptide depletion is above this threshold, a test substance is considered to be peptide reactive.According to OECD TG 442C the DPRA can be used, together with complementary information, to discriminate sensitisers and non-sensitisers.Depending on the regulatory framework a positive result of the DPRA can serve as standalone information for classifying substances into Category 1 for skin sensitisation.However, as emphasised in the ECHA Guidance on information requirements and Chemical Safety Assessment Chapter R.4a (ECHA 2016) the DPRA should not be used in isolation for identifying a skin sensitiser or non-sensitiser.

The ARE-Nrf2 luciferase method
The ARE-Nrf2 luciferase method utilises the gene induction regulated by the antioxidant response element (ARE) in transgenic human keratinocyte cell lines.The OECD Test Guideline TG 442D (OECD, 2015b) was adopted in 2015.The ARE-Nrf2 luciferase method is covered by KeratinoSens TM (Natsch et al., 2011) and LuSens (Ramirez et al., 2014).In this study, the LuSens assay is used.In ARE-Nrf2 luciferase methods the keratinocyte activating potential is determined by measuring luciferase induction after treatment with a test substance treatment relative to concurrent vehicle controls.A statistically significant fold induction (FI) of the luciferase activity above 1.50 is considered to indicate a keratinocyte activating potential of a test substance.The classification threshold T for LuSens is , above which a substance is considered to have a keratinocyte activating potential.Similar to the DPRA, LuSens is not considered suitable for classifying substances as skin sensitisers or non-sensitisers if used in isolation (ECHA 2016).

The human Cell Line Activation Test
The human Cell Line Activation Test (h-CLAT) (Ashikaga et al., 2006;Ashikaga et al., 2010;Sakaguchi et al., 2006;Sakaguchi et al., 2010) determines the dendritic cell activating potential by measuring the induction of the expression of the cell surface markers CD54 and CD86 after treatment with a test substance relative to concurrent vehicle controls in immortalised human monocytic leukemia THP-1 cells as a surrogate of DCs.As indicated in the OECD testing guideline TG 442E (OECD, 2016) a two-fold induction of the CD54 expression and/or 1.50 fold induction of CD86 expression at relative cell viabilities of at least 50% is considered to indicate a dendritic cell activating potential of a test substance.The classification thresholds T for the h-CLAT are and .Like for the DPRA and LuSens, the method only addresses a specific key event of the skin sensitisation AOP.Consequently, it should not be used in isolation for classifying skin sensitisation potential (ECHA 2016).

The '2-out-of-3' ITS for characterising skin sensitisation potential
The '2-out-of-3' ITS (Bauch et al., 2012;Urbisch et al., 2015; OECD 2016 a and b, see also Sauer et al. 2016) is an integrated testing strategy for the assessment of skin sensitisation potential.According to this approach, 2 out of 3 concordant test results using the DPRA, ARE-NrF2 luciferase method, and the h-CLAT determine the prediction.The ARE-NrF2 luciferase method can be covered by LuSens or KeratinoSens TM .The '2-out-of-3' ITS addresses the first three consecutive key events of the AOP for skin sensitisation and it is a selected case study for integrated approaches to testing and assessment (IATA) (Urbisch D., 2015).Applying the BR concept to the '2-out-of-3' ITS provides a measure for evaluating the performance of this specific IATA case.

2.2
Approach to quantifying the borderline range (BR) The first step for assessing a testing method's limited precision was to develop an approach for the quantification of the BR.The BR denotes the area around the classification threshold for which a testing method's prediction model may deliver discordant results in repeated applications.For each non-animal testing method considered, and for the animal test LLNA, we derived the BR from the pooled standard deviation of a testing method's results, (Eq.1), pooled across substances i, and concentrations (i.e. the dose in case of the LLNA).The notation used is presented in Table 1.We use the pooled standard deviation to define the BR around a prediction model's classification threshold T: .
(1) Thus, it is assumed that the BR is symmetric around the classification threshold.For a given testing method, the pooled standard deviation of experimental results retrieved from testing different substances and concentrations is calculated as follows: , where is the variance of results for substance i and concentration .The standard deviation per substance i and concentration is given by , (3) which acknowledges that for a certain concentration different replicates can be generated.A numerical example illustrating the approach described above is presented in Appendix 2. For determining the BR according to Eq. ( 1) we consider the part of the distribution of test results that is close to the classification threshold T to be most relevant.Therefore, we use values from pre-defined ranges of test results around the threshold.As it cannot be determined ex ante how broad or narrow a range should be, we conducted a sensitivity analysis, considering different ranges.In this way, we gain insight into the relationship between the size of the BR and the number of borderline substances.The ranges are shown in Table 2.Note that the BR approach suggested in this paper goes beyond Kolle et al. (2013), who calculated the BR only for the LLNA based on individual animal data.
In case of the DPRA the BR was quantified using results from repeatedly testing n = 42 substances with different substance concentrations.The tests were conducted in a GLP-certified laboratory of BASF SE, yielding 446 individual results including the positive control (see Appendix 1, Table 1).The cysteine and lysine depletion per substance and concentration were randomly paired.For each pair, we determined the mean peptide depletion rate (MPD) per substance and concentration.The ranges of test results considered for calculating the pooled standard deviation are presented in Table 2.For each pooled standard deviation, the corresponding BR was determined according to Eq. ( 3).
The in the prediction model of LuSens was calculated using test results from repeatedly testing n = 26 substances, including the positive and negative control, yielding 2206 individual results covering different concentrations per substance (see Appendix 1, Table 2).Again, experiments were conducted in a GLP-certified laboratory of BASF SE (using the Multimode Reader TriStar2 luminometer -Berthold Technologies, Germany), using a luciferase fold induction value of as classification threshold T. Based on the available dataset the pooled standard deviation was calculated for defined ranges of test results (Table 2).Furthermore, only test substance concentrations with at least 70% relative viability were included in the analysis.The BR corresponding to each pooled standard deviation was determined according to Eq. ( 3).
The BR around the classification threshold of the h-CLAT was calculated using test results from n = 13 substances tested during routine (in house) test applications, yielding 528 individual measurements covering different substances and concentrations (see Appendix 1, Table 3).The pooled standard deviation was quantified for defined ranges of fold inductions (FI) of CD54 CD86 expressions documented in Table 2, and for substance concentrations with at least 50% relative viability.Like for the DPRA and for LuSens the BR corresponding to each pooled standard deviation was calculated according to Eq. (3).Finally, the BR of the LLNA was quantified using test results obtained from the n = 22 performance standard (PS) substances (ICCVAM, 2009) that were repeatedly tested according to GLP, yielding 479 test results for substances tested at different concentrations (see Appendix 1, Table 4).Like for the non-animal testing methods the pooled standard deviation of LLNA test results was determined for different data ranges (Table 2).

Decision rules for identifying borderline substances in experimental sets tested with individual non-animal methods
The BR, determined according to the approach described in Section 2.2, can be applied to experimental sets of substances tested with non-animal methods.The aim is to detect those substances for which results fall within the BR and, hence, for which a clear-cut classification is not possible with sufficient confidence.
Depending on the prediction model of the individual non-animal methods considered the application of the BR differed.In case of the DPRA substances were defined borderline if the mean depletion rate was within the BR (see also Table 6 in Section 3.1).As described in Ramirez et al. (2014) the prediction model of LuSens requires that two consecutive concentrations per run reveal results above the classification threshold in order to assess the test substance as positive.For LuSens, therefore, we first established decision rules for determining the result across all concentrations considered in a run.
As illustrated in Table 3, for a given BR around the classification threshold of the LuSens prediction model the outcome of a run was concluded to be positive if all results were above the upper margin of the BR.If the first concentration (denoted x in Table 3) gave a negative result and the consecutive concentration (x+1) was either tested borderline or negative, it was concluded that the overall test outcome of this run is negative.If LuSens revealed a borderline result for a certain concentration x and the follow-up concentration (x+1) was tested borderline or positive, the substance was considered borderline.In case of the h-CLAT at least one of the test results of either the CD54 expression or the CD86 expression from at least one of the runs in an experiment has to fall into the BR for qualifying an experimental result as borderline.Hence, the conclusion on the overall result of the experiment (positive, negative) is based on results from just one concentration.
Second, we established a decision rule that allows concluding on the overall test result across runs.This was necessary because the testing protocols for LuSens and the h-CLAT require conducting two or more runs in order to classify a substance according to the results.In case of LuSens a complete experiment consists of two independent runs (each of which covers different concentrations, see Ramirez et al. (2014)).If the results from two runs are discordant, a third run has to be conducted and the conclusion on a substance's skin sensitisation potential is based on the majority outcome.Similarly, a test substance is tested in the h-CLAT in two independent runs.If the results are discordant, another run has to be performed (OECD 2016).Acknowledging that dichotomised test results can be either positive, negative, or borderline, adopting a final conclusion on a substance's sensitisation potential make take up to four runs.The corresponding decision rules are shown in Table 4.

2.4
Decision rules for identifying borderline substances tested with the '2-out-of-3' ITS Considering the BR of the prediction models of individual non-animal testing methods changes the possible outcomes of each method to be negative, positive, or borderline/ambiguous. Since test results of borderline substances can (by definition) not unambiguously be denoted positive or negative the respective substances cannot be compared with results from a reference animal test in order to conclude whether the test result is FP (i.e.erroneously classified as positive) or FN (i.e.erroneously classified as negative).The skin sensitisation potential, however, is assessed by a combination of the results of non-animal testing methods addressing different steps of the adverse outcome pathway (Jaworska, 2016;Kleinstreuer et al., 2016;Strickland et al., 2016).One of the simplest, yet successful, ways to do this, is the '2-out-of-3' ITS (Bauch et al., 2012;Urbisch et al., 2015).The '2-out of-3' ITS uses dichotomised results of individual non-animal testing methods (i.e.positive or negative).If a borderline/ambiguous outcome of an individual testing method is considered in the '2-out-of-3' ITS, its overall conclusion of the skin sensitisation potential of a test substance may as well be borderline/ambiguous (or negative or positive).The '2-out-of-3' ITS assigns equal weights to each testing method.Hence, in a non-sequential setting the order of ALTEX Online first published February 23, 2017 https://doi.org/10.14573/altex.16062716 results of the individual methods does not matter.Consequently, one testing method yielding a borderline/ambiguous result will not change the overall result of the '2-out-of-3' ITS if the other two methods provided concordantnegative or positive results.If test results two non-animal testing methods fell into the BR of their prediction models, the overall outcome was considered borderline.Likewise, the overall conclusion on the result of the '2-out-of-3' ITS was borderline if the three methods yielded positive, negative and borderline/ambiguous results, respectively.Table 5 lists the overall outcome of the '2out-of-3' ITS depending on the results of the prediction models of the individual non-animal testing methods.

3.1
Quantification of the borderline range (BR) for the DPRA, LuSens, the h-CLAT and the LLNA Table 6 shows for each testing method the ranges of test results used for calculating the pooled standard deviation, the corresponding pooled standard deviations and the retrieved BR values of the testing methods' prediction models.If a substance is tested with any of the testing methods shown in Table 2, and if the result falls within the BR of its prediction-model, a clear-cut conclusion about the substance's response in this testing method is not possible with sufficient confidence.If, for instance, the BR: SI = {2.89%,9.87%} is selected for the DPRA prediction model and a substance reveals a mean peptide depletion within this range, the result can neither be concluded to be negative nor to be positive.Instead, such test result would have to be qualified as 'borderline' because the result is likely to vary in repeated runs.

3.2
Identification of borderline substances in experimental sets tested with the non-animal testing methods DPRA, LuSens, and h-CLAT, and with the animal test LLNA Substances for which test results fell within the BRs of the testing methods' prediction models are listed in Table 7. Obviously, an increase of the BR referring to a certain prediction model caused the number borderline substances to increase.
Depending on the size of the BR we found the number of borderline substances to be between 20 and 57 (of 199) in case of the DPRA.Of the 79 substances tested with LuSens, 4 and 5 were considered borderline.Regarding the h-CLAT the number of borderline substances varied between 8 and 10 (of 40), and in case of the LLNA we found the number of substances considered borderline to be 5 and 7 (of 22).A detailed list of all substances considered borderline under different BRs is presented in Appendix 3.  1-4 for a list of substances in the experimental sets to which the BR approach was applied.Source: Own calculations.
Table 8 presents a list of substances considered borderline for each BR listed in Table 7.With regard to the largest BR considered for the DPRA (i.e.mean depletion between 1.35% and 11.41%), 11 of the 57 substances considered borderline were tested positive in the LLNA.Regarding the smallest BR considered, 9 of the 20 substances in the set tested with the DPRA revealed negative and 11 a positive test results in the LLNA.Of these, one substance (Salicylic acid), was also considered borderline in the LLNA (see Table 8).As illustrated in Figure 1, most substances considered borderline were nonsensitisers in the LLNA.Source: Own calculations based on results documented in Table 8.
In case of LuSens, doubling the data range for calculating the pooled standard deviation increased the BR only marginally. 4 of the 5 substances for which positive test results were within the BRs of LuSens revealed positive but non-borderline results in the LLNA.Of these, one was a non-sensitiser, 3 substances were weak, and one was a moderate sensitiser (see Table 8 and  Appendix 3).Within the BRs of the h-CLAT all substances considered borderline were positive and also sensitisers in the LLNA. 3 substances were weak sensitisers.Of these, one substance (Phenyl benzoate) was also in the BR of the prediction model of the LLNA.Two substances in the BRs of the h-CLAT were moderate, three strong and one an extreme sensitiser, respectively.Substances considered borderline when applying the largest BR considered for each testing method (see also Table 6). 2 Prediction based on (Urbisch et al., 2015), human data were extracted from (Basketter et al., 2014).Of the borderline substances in the experimental set of the LLNA two (i.e.phenyl benzoate, methyl methacrylate) are weak sensitisers, and four (i.e.salicylic acid, methyl salicylate, chlorobenzene, nickel chloride) are non-sensitisers (Table 4).One substance (MCI/MI) was an extreme sensitiser.Most substances identified as borderline in the LLNA are also discussed in Kolle et al. (2013).For the smallest BR considered (i.e.BR = { }) our analysis revealed an equivalent percentage of borderline chemicals (23%) compared to Kolle et al. (2013).Increasing the range around the classification threshold that is used for calculating the pooled standard deviation and, in turn, the BR (i.e.BR = { }), also phenyl benzoate is a borderline substance, causing the percentage of substances falling in the BR of the LLNA to be slightly higher (27%) compared to Kolle et al. (2013) (23%).Note, however, that Kolle et al. (2013) determined the BR by calculating coefficients of variation based on individual animal data instead of pooled animal data.

3.3
Identifying borderline substances in the experimental set tested with the '2-out-of-3' ITS As shown in Table 9 we found four of 40 (10%) of the substances tested with the '2-out-of-3' ITS to be borderline.This result was robust for all combinations of BRs considered in the prediction models of individual non-animal testing methods.A complete list of ITS results, including results for borderline substances, is included in Appendix 3.All substances were positive in the LLNA.Of these, one is a weak, one a moderate and two substances are strong sensitisers according to LLNA potency classes.One substance (Phenyl benzoate) considered borderline in the '2-out-of-3' ITS was also borderline in the LLNA.(Urbisch et al., 2015).

Identification of borderline substances and implications of the BR for assessing substances' skin sensitisation potential
The BR defines the area around a prediction model's classification threshold within which repeated testing will more likely show discordant results.That is, within the BR a testing method is not precise and conclusions about a borderline substance's skin sensitisation potential cannot be adopted with sufficient confidence.This limited precision is caused by technical and biological variability.If a substance yields test results falling into the BR, further testing with other available testing methods is required to allow for an unanimous discrimination between a positive and a negative test outcome.The probability of a substance with unknown properties to reveal a borderline result depends on the size of the BR and the distribution of test results.The latter, in turn, depends on the composition of the experimental set of substances.Conclusions regarding the probability of a substance to reveal a borderline result are, therefore, only possible for a particular BR and assuming a representative set of substances.
In this study, we quantified the BR for prediction models of three non-animal testing methods, their combination in a '2-out-of-3' ITS and the animal test method, LLNA.The BR was derived from the pooled standard deviation around the individual testing methods' classification threshold.We considered different BRs in order to gain insight into the relationship between the size of the BR and the number of substances for which results fall into this range and for which, consequently, a unanimous conclusion on their skin sensitisation potential cannot be adopted with sufficient confidence.Based on the BRs considered and the experimental sets used in our analysis the percentage of substances considered borderline in the DPRA, LuSens and the h-CLAT was between 6% and 28% (Table 7).We find that 23% -27% of the performance standard (PS) substances tested with the LLNA fall into its BR.This is slightly higher than results obtained from the variability assessment in Hoffmann (2015), which may be explained by considering that Hoffmann (2015) determined the BR from EC3 values, compared to SI values used in our analysis.
For the DPRA the percentage of substances identified as borderline varied between 10% and 28%.LuSens has a stringent prediction model because two consecutive concentrations in each run, and two or more runs, must show concordant results in order to arrive at the final conclusion about a substance's skin skin sensitisation potential (Ramirez et al., 2014;Ramirez et al., 2016).Therefore, applying the BR approach to LuSens required two steps to identify borderline substances (i.e.identifying borderline substances within a run and across runs, see Table 3 and 4).The stringent prediction model may be a reason why LuSens revealed a relatively small percentage of borderline substances (6% and 7%, depending on the size of the BR, see Table 7).The prediction model of the h-CLAT (Ashikaga et al., 2010;Sakaguchi et al. 2010) does not require concordant test results in consecutive concentrations of the same run to conclude on the substance's skin sensitisation potential.Furthermore, concordant test results with cell surface markers CD54 expression or CD86 expression from at least two runs within the same experiment are required to conclude on a positive or negative test result (OECD, 2016).Hence, compared to LuSens, the prediction model of the h-CLAT is less conservative.This may explain the slightly higher percentage of borderline results in the substances set tested with the h-CLAT (20% and 25%, see Table 7).All borderline substances in the experimental set of the h-CLAT were sensitisers.

4.2
Precision of the '2-out-of-3' ITS Following Jowsey et al. (2006), Basketter and Kimber (2009), Reisinger et al. (2015), and ECHA (2016) a single testing method cannot be used to predict skin sensitisation potential as a stand-alone method.The '2-out-of-3' ITS has been suggested as a suitable approach for the overall assessment of the skin sensitisation potential based on the results of three individual testing methods (Urbisch D., 2015).Applying the BR concept to the '2-out-of-3' ITS (Urbisch et al., 2015) revealed four borderline substances in a set of 40 (10%), which is lower than that of the LLNA (27%).The number of borderline substances identified remained constant for all BRs applied to individual non-animal testing methods (see Table 5 in Appendix 3).Our results, therefore, may indicate that the precision of the '2-out-of-3' ITS is higher compared to the LLNA.Again, this result has to be treated with care because the experimental set of the LLNA differed from that of the nonanimal testing methods used in the '2-out-of-3' ITS.Notwithstanding, the majority rule applied by '2-out-of-3' ITS reduces the influence of borderline substances on the overall conclusion about a substance's skin sensitisation potential for all cases where two of the three methods provide concordant results.This, in turn, increases the overall precision of the '2-out-of-3' ITS compared to the precision of the individual non-animal testing methods.

4.3
Implications of the BR approach for evaluating a testing method's precision The share of substances considered borderline in a an experimental set depends on the size of the BR, which, in turn, depends on the precision of the experimental method, the specification of the classification threshold, and on the data range the threshold used for quantifying the BR.We find that the number and the percentage of test results that fall in the BR is the higher (lower) the larger (smaller) the BR.
The BR in a testing method's prediction model defines a range in which conclusions on substances' skin sensitisation potential cannot be drawn with sufficient confidence.Hence, for substances for which test results fall in the BR the test result is inconclusive.Furthermore, our results illustrate that the number of substances for which classifications can be made is the smaller the broader the BR.This points to a trade-off between a testing method's precision (i.e. after removing results that fall in the BR) and the number of substances in the set for which a testing method is able to deliver decisive information.So far, normative criteria defining how broad the BR should be have not been established.The evidence provided in our study, therefore, does not allow for comparisons of precision (limitations) between individual testing methods.Further research must examine how the specification of a prediction model's classification threshold, in combination with the range and distribution of test results used for calculating the BR, affects the size of the BR.

Conclusions
Technical and biological variability of non-animal testing methods used for assessing skin sensitisation potential, and the animal test LLNA, influence the precision of these methods.It is important to recognise that neither the animal test LLNA, often considered "the gold standard", nor non-animal testing methods perfectly predict effects in humans (due to limited accuracy) and do not always yield clear-cut results (due to limited precision).A testing method's precision constraint caused by intra-assay variability can be captured by quantifying a BR around the classification threshold of the method's prediction model, which is used to transform continuous experimental data into a dichotomous result, being either 'positive' (indicating an effect) or 'negative' (indicating no effect).The size of the BR depends on the specification of the classification threshold in a testing method's prediction model, but also on the range of test results on both sides of the threshold that is considered appropriate for determining the BR.Test substances for which results fall within the BR of a testing method could be assessed as positive or negative upon re-testing; thus the result of the test is ambiguous.Quite obviously, any conclusion drawn from experimental data is constrained by uncertainties and this is often neglected in reporting the results.The BR may offer a simple and pragmatic way to take into account that not every experimental result allows for a definite conclusion.We therefore suggest that a measure of precision, i.e. the percentage of substances falling in the BR, should therefore be reported with every testing method.Furthermore, when using prediction models that dichotomise data there should always be three potential outcomes: positive, negative or borderline.The aim of our paper was to suggest an approach how the BR can be determined, and to illustrate for individual testing methods and an ITS how the BR can be applied.As our results illustrate, there is a trade-off between the size of the BR (i.e. the range for which a testing method is considered to be of limited precision) and the number of substances for which a test can assess skins sensitisation potential with sufficient confidence.Being beyond the scope of this paper, it is a matter of further research to define normative rules for determining the 'optimal BR'.This, essentially, requires assessing the (social) gains from increasing a testing method's precision against (social) costs of making errors.
While the paper focused on skin sensitisation as a proof-of-concept case, the BR approach is a generic method and can be applied to other endpoints, tests, and ITSs.Further research should, quantify the BR for a broader set of (non-animal) testing methods.Moreover, examining the precision of testing methods for continuous endpoints deserves further attention in order to provide complementary insights into testing methods' precision regarding potency assessment (Slob 2016).
Another important issue for further research and discussion is how to deal with borderline test results in a regulatory context.One possible option could be to define borderline outcomes per default as positive results.However, this would imply that the upper part of the BR is factually ignored.Alternatively, one could require additional testing with other (non-animal) methods and would thus advocate for redundant testing method options.Decision-theoretic approaches such the Bayesian Value-of-Information approach introduced in Leontaridou et al. (2016) can help to determine the optimal follow-up test in a systematic and transparent way.Finally, the question how borderline substances impact testing methods' predictive performance deserves further attention.Since for borderline substances the overall conclusion on their hazardous potential remains inconclusive, they cannot contribute unambiguously to the evaluation of a testing method's accuracy.Ignoring a substance's testing result being borderline, therefore, will cause over-or underestimation errors of, for example, a testing method's sensitivity or specificity.Exploring the size and direction of this errors for different non-animal testing methods, and analysing the influence of the size and composition of experimental sets on the number of borderline substances detected, will provide complementary insights into the implications of intra-assay variability for comparative evaluations of testing methods' predictive performance.

Tab. 2 :
Number of substances, individual results, and the range of test results used for determining the pooled standard deviation around the classification threshold T in the prediction model of the considered testing

Tab. 3 :
Decision rule for concluding on the overall test result of LuSens from two consecutive concentrations in a run Concentration x test result, indicating that a substance has no keratinocyte activating potential.P: Positive test results, indicating that a substance has keratinocyte activating potential.B: Substances with test results within the BR.

Tab. 4 :
Decision rules for LuSens and the h-CLAT to conclude on the overall test result from not imply a defined order of results.N: Negative test result, i.e. a substance does not have a keratinocyte activating potential (LuSens) or a dendritic cell activating potential (h-CLAT); P: Positive test result, i.e. a substance has a keratinocyte activating potential (LuSens) or a dendritic cell activating potential (h-CLAT); B: Substances for which test results fall within the BR determined for either LuSens or the h-CLAT.

Tab. 5 :
Decision rules to conclude on the overall result using the '2-out-of-3' ITS when considering borderline substances in individual non-animal testing methods Dichotomised result from nonanimal testing methods A/B/C does not imply a certain sequential order of testing in the '2-out-of-3' ITS.N: Negative test result, i.e. a substance does not have a peptide reactivity potential (DPRA), a keratinocyte activating potential (LuSens), or a dendritic cell activating potential (h-CLAT); P: Positive test result, i.e. a substance has a peptide reactivity potential (DPRA), a keratinocyte activating potential (LuSens), or a dendritic cell activating potential (h-CLAT); B: Substances for which test results fall within the BR of the DPRA, LuSens or the h-CLAT prediction model.

Figure 1 :
Figure 1: Number of substances considered borderline and their potency classes (non-sensitiser, weak, moderate, strong) according to LLNA results for different BRs in the DPRA prediction model.(X-axis: BRs considered for the DPRA [MPD in %]; Y-axis: Number of borderline substances).

Tab. 1: Notation for calculating the pooled standard deviation of experimental results per substance and concentration (dose in case of the LLNA) Notation
Explanation T Classification threshold in a testing method's prediction model i Substance Concentration tested per substance i Number of replicates per substance i and concentration Replicate per substance i and concentration j Test result of substance i, concentration and replicate l Arithmetic mean of test results for substance i and concentration

T Range of test results around T used for calculating the pooled standard deviation Number of individual results from testing n
Classification threshold; SI: Stimulation index; FI: Fold induction; CD54 FI, CD86 FI: Fold induction of cell surface marker expression; MPD: Mean peptide depletion; see Appendix 1 for a list of substances in the substances sets used for quantifying the BR.

Tab. 7: Number and percentage of borderline substances in the experimental sets tested with the DPRA, LuSens, the h- CLAT and the LLNA 1 Testing method Number of substances in the set Borderline range (BR)
Stimulation index; FI: Fold induction; CD54 FI, CD86 FI: Cell surface marker expressions; MPD: Mean peptide depletion rate.