LLNA Variability : An Essential Ingredient for a Comprehensive Assessment of Non-Animal Skin Sensitization Test Methods and Strategies

The development of non-animal skin sensitization test methods and strategies is quickly progressing. Either individually or in combination, the predictive capacity is usually described in comparison to local lymph node assay (LLNA) results. In this process an important lesson from other endpoints, such as skin or eye irritation, i.e., that the variability of reference test results – here the LLNA – must be accounted for, has not yet been fully acknowledged. In order to provide assessors as well as method and strategy developers with appropriate estimates, the variability of EC3 values from repeated substance testing in the LLNA was investigated using the publicly available NICEATM (NTP Interagency Center for the Evaluation of Alternative Toxicological Methods) LLNA database. Repeat experiments taking the vehicle into account (76 substances) or combining data over different vehicles (38 substances) were analyzed. In general, variability was higher when different vehicles were used. In terms of skin sensitization potential, i.e., discriminating sensitizers from non-sensitizers, the false positive rate ranged from 14-20%, while the false negative rate was 4-5%. In terms of skin sensitization potency, the rate to assign a substance to the next higher or next lower potency class was approx.10-15% each. In addition, general estimates for EC3 variability are provided that can be used for modelling purposes. This analysis stresses the importance of considering the LLNA variability in the assessment of skin sensitization test methods and strategies and provides estimates thereof.


Introduction
Along with the advances in the life sciences, new testing and non-testing methods for improved, more efficient and animalfree assessment of toxicological hazards are being developed at an increasing rate.Once such a method or strategy has reached a certain level of standardization, it is often evaluated to demonstrate its predictive performance, usually by comparing it to the currently regulated hazard assessment test method it aims to complement or ultimately to replace.In many cases the regulated approach is an animal experiment, so that a dedicated study directly comparing both methods in parallel is not possible due to ethical concerns.In general, this problem is solved by comparing data from the new method/strategy with existing data from the routine methods for the same set of substances.While data quality and reproducibility aspects are controlled and systematically assessed for the new approach, the same rigor cannot be applied to existing data of the routine method.Disregarding these aspects for the routine test methods inevitably results in an overestimation of its predictive performance, which consequently results in unrealistically high expectations for the predictive capacity of the new test method/testing strategy.In order to at least partially compensate for this, traditionally used animal experiments have been thoroughly investigated by deriving estimates of variability.This has, for example, supported the regulatory acceptance of in vitro test methods for the human health effects skin irritation and eye irritation/corrosion (Hoffmann et al., 2005;Adriaens et al., 2014).
Various test methods that address the human health endpoint skin sensitization are being developed and many of these have been evaluated systematically by Reisinger et al. (2014).This development was spurred by European regulatory requirements: the marketing ban on cosmetics as well as the REACH regulation on chemicals demand or strongly call for skin sensitization assessment of substances without the use of animals (EU, 2006(EU, , 2009)).The predictive capacity of individual skin sensitization test methods and testing strategies is primarily assessed by comparison with LLNA data (see e.g., Jaworska et al., 2013;Tsujita-Inoue et al., 2014).Furthermore, there have been attempts to circumvent the sub-optimal comparison with animal data by comparison with human data, which has been fueled by a compilation and categorization of human data proposed by Basketter et al. (2014).These comparisons have considered the reference data of the LLNA (and the human data) in a deterministic manner, i.e., without accounting for the aspect of variability, as for example pointed out by Urbisch et al. (2015).Nevertheless, these efforts have culminated in the acceptance of two OECD test guidelines of individual test methods (OECD, 2015a,b), while guidance on testing strategies is being developed.
With the aim to support assessors as well as method and strategy developers of both non-animal methods and testing assessment strategies for skin sensitization hazard and potency of substances, estimates of LLNA EC3 variability from repeat testing data of 68 substances were derived.Repeat tests that used the same vehicle and repeat tests that used different vehicles were analyzed separately.For this purpose, the LLNA data from the publically available NICEATM (NTP Interagency Center for the Evaluation of Alternative Toxicological Methods) LLNA database was used.The impact of the variability on LLNA potency classes was analyzed.
The results stress the importance of accounting for LLNA variability in the assessment of skin sensitization test methods/ strategies and provide estimates thereof.

Material and methods
The publicly available NICEATM LLNA data compilation 1 , version of December 23, 2013, was the sole data source.In total, it reports results of 1060 experiments using 35 different vehicles for 677 different substances and formulations, specifying the vehicle used to apply the substance and the EC3 value in %, i.e., the estimated concentration that induces a three-fold stimulation index as compared to the respective vehicle.Experiments not inducing three-fold stimulation were considered non-sensitizers and were reported as "NC", i.e., non-classified, while for experiments with data with insufficient dose-response for the calculation of EC3 (i.e., nonmonotonic) the EC3 value was reported as "IDR", i.e, insufficient dose-response.The data were considered to be sufficiently curated for the purpose of this evaluation and were not verified against the primary sources.After exclusion of "IDR" experiments, LLNA experiments, for which the same CAS number and name or synonyms were listed, were identified (a total of 454 experiments for 72 substances/mixtures) and respective EC3 values were grouped for analysis, once for substances with repeat experiments using the same vehicle ("same-vehicle" approach) and once for substances with repeat experiments using different vehicles ("differentvehicle" approach).
For both approaches median EC3 values -a location measurement that is robust against aberrant values and that also could be derived for substances with both EC3 and "NC" results -were calculated from all repeat experiments of a given substance.This median was used to assign each substance to one of five potency classes: extreme: median < 0.1%; strong: 0.1% ≤ median < 1.0%; moderate: 1.0% ≤ median < 10.0%; weak: 10.0% ≤ median ≤ 100%; non-sensitizer: NC (ECETOC, 2003).In case of two repeat experiments with one EC3 value and one "NC" result, the substance was conservatively assigned to the class that corresponded to the EC3 value.
Substances with medians in the same potency class were grouped for variability analysis.The impact of variability of EC3 of repeat experiments on potency class assignment was analyzed by determining per group the proportion of all individual EC3 data that would result in a different potency class than the median EC3.
In addition, variability was described for each substance by the standard deviation (SD) of log-transformed (base 10) EC3 of the repeat experiments.Substances that were non-sensitizing in at least one repeat test were excluded.In addition, substances with only two repeat experiments were excluded, as a sample size of at least three repeats was considered sufficient for an acceptably precise SD estimation.From this set of SD, estimated for both the "same-vehicle" and the "different-vehicle" approach, a general estimate of SD variability was derived.

Results and discussion
For the "same-vehicle" approach, filtering of the database resulted in 53 different substances with a total of 76 substance-vehicle combinations (Fig. 1A).For example, repeat experiments were available for seven different vehicles for 1,4-dihydroquinone.In total, 356 LLNA EC3 values were used for calculations.Data were distributed over all five potency classes.The "extreme" class with 12 substances and 49 experiments was the least populated (Tab.1A).Applying the "different-vehicle" approach resulted in 38 substances and a total of 333 experiments for analysis (Tab.1B).LLNA potency classes were unequally populated, both in regard to the amount of substances and the amount of experiments.For both approaches, Table 1 summarizes the proportion of more and less severely classified individual experiments for each potency class.For the "samevehicle" analysis, on average in 9.3% of the cases a less severe classification was observed, while a more severe classification was present in 15.2% of the cases.For the analysis that combined experiments using different vehicles, the misclassification rates were 14.1% and 15.3, respectively.The majority of misclassification was one class above or below the median class. of "NS" and "S", i.e., the classes "moderate" to "extreme", resulted in an overprediction proportion (NS as S) of 14.1% and an underprediction (S as NS) proportion of 3.8% (11/292) for the "same-vehicle" approach and in 19.7% and 5.1% (14/272), respectively, for the "different-vehicle" approach.

Reducing the potency classes to dichotomous hazard classes
Repeat experiments also provided the means to generalize LLNA variability.To increase the robustness of this approach, substances with two repeat experiments were excluded.In addition, substances with at least one NC result, which may simply be explained by different test concentration ranges, were disregarded.It needs to be noted that this approach resulted in exclusion of some of the most variable cases potentially resulting in a systematic underestimation of variability.In this regard, it represented a conservative approach and EC3 variability of repeat experiments is likely higher.
For the "same-vehicle" approach 27 substances were considered.SD values ranged from 0.137 to 1.048 with a median SD of 0.252.For the "different-vehicle" approach 11 substances were included.Their SD values ranged from 0.164 to 0.691, while the median was 0.312.Assuming that log-transformed EC3 are approximately normally distributed, the median SD values can, for example, be used to calculate the most likely probability distribution, confidence intervals and probabilities for over-and under-classification for any given EC3.Consider the example that a LLNA test of a substance with unknown sensitization potential resulted in an EC3 point estimation of 20% that would trigger a classification as "weak sensitizer".An approximate 95%-confidence interval (CI) can be calculated in a simple manner by adding and subtracting 2 * SD from the log-transformed median (log(20) -2 * SD = 0.797; log(20) + 2 * SD = 1.805).Retransformation results in a 95%-CI for the EC3 ranging from 6.27 to 63.83.The likelihood that the substance is a moderate sensitizer, i.e., has an EC < 10%, is 11.6%, while the likelihood that it is a non-sensitizer is as low as 0.3%.Calculating the same example with the "different-vehicle" approach median SD of 0.312 results in a likelihood of 16.7% for "moderate" and of 1.3% for "non-sensitizer".
This relatively simple example demonstrates that the information from repeat LLNA experiments can be used to account for LLNA variability in statistical approaches.First of all, it provides an approach for more appropriate assessment of any new skin sensitization testing method and of testing strategies.Instead of comparing with deterministic EC3 values or classification derived from such values, likelihoods of over-and underclassification can be estimated for each substance to derive more realistic estimates of LLNA predictive capacities for any given substance or set of substances.In this way some of the uncertainty associated with LLNA data can be quantified and accounted for, potentially facilitating discussions about the acceptance of new skin sensitization testing approaches.

Tab. 1 :
Variability of categorizations of repeat testingusing A) the same vehicle based on llNA eC3 for 76 substances and B) different vehicles for 38 substances grouped by potency (median classification of repeats used for potency class assignment; NS: non-sensitiser; subs.: substances)