Integrating Non-Animal Test Information into an Adaptive Testing Strategy – Skin Sensitization Proof of Concept Case

there is a pressing need for non-animal test methods, driven by the forthcoming ban on animal testing for cosmetic ingredients in Europe, the large number of tests potentially required to fill in data gaps for the ReACH legislation, and animal welfare concerns. Skin sensitization was identified as the hazard endpoint for which most animal tests would need to be conducted and which required a large number of animals (van der Jagt et al., 2004). the individual steps involved in the skin sensitization process are illustrated in File A (Supplementary Data at www.altexedition.org). In short, the chemical must penetrate the skin to react with endogenous proteins, either directly or after activation through enzymatic or oxidative processes. Next, epidermal langerhans cells (lC) and immature dendritic cells (DC) take up and process haptenated proteins. lC cells mature into antigen presenting cells, which after migration to the lymph nodes present haptenized protein fragments to t-cells. Many research groups are working on the development of alternative tests for skin sensitization (Vandebriel and van loveren, 2010; Aeby et al., 2010). As a chemical’s reactivity towards proteins is deemed a key determining factor in its ability to act as a skin sensitizer, a lot of research has focused on in chemico measurements of reactivity with model nucleophiles. Various nucleophiles are used and experiments are either done by direct peptide reactivity measurements (Gerberick et al., 2008), semi-kinetic (Aptula et al., 2006) or more complex kinetics (Aleksic et al., 2009). In addition, the induction of an antioxidant response element (ARe) dependent gene activity in a recombinant cell line (Natsch et al., 2008) can be used to indirectly characterize reactivity. to further elucidate the skin sensitization induction process, various measures of dendritic cell activation are considered. Recent advances in the in vitro generation of immature dendritic cells and the availability of cell line surrogates with various DC-like characteristics has led to the development of in vitro tests based on the measurement of various cell surface markers or secretion of cytokines modulated upon exposure to chemicals (Aeby et al., 2010; Ryan et al., 2005; lambrechts et al., 2010). Numerous attempts were also made to predict in silico the skin sensitization potential in vivo (Roberts et al., 2007; Patlewicz et al., 2007, 2008; Patlewicz and Worth, 2008; Karlberg et al., 2008). Many authors share the opinion that a single test method cannot replace the in vivo skin sensitization animal testing; however, it remains open which tests are actually needed. to address this point of view, several data integration frameworks have been developed. Jowsey et al. (2006) proposed a conceptual scoring system based on Structure Activity Relationships (SAR), penetration, peptide reactivity, and dendritic and t-cell activation Summary There is an urgent need to develop data integration and testing strategy frameworks allowing interpretation of results from animal alternative test batteries. To this end, we developed a Bayesian Network Integrated Testing Strategy (BN ITS) with the goal to estimate skin sensitization hazard as a test case of previously developed concepts (Jaworska et al., 2010). The BN ITS combines in silico, in chemico, and in vitro data related to skin penetration, peptide reactivity, and dendritic cell activation, and guides testing strategy by Value of Information (VoI). The approach offers novel insights into testing strategies: there is no one best testing strategy, but the optimal sequence of tests depends on information at hand, and is chemical-specific. Thus, a single generic set of tests as a replacement strategy is unlikely to be most effective. BN ITS offers the possibility of evaluating the impact of generating additional data on the target information uncertainty reduction before testing is commenced.


Introduction
there is a pressing need for non-animal test methods, driven by the forthcoming ban on animal testing for cosmetic ingredients in Europe, the large number of tests potentially required to fill in data gaps for the ReACH legislation, and animal welfare concerns.Skin sensitization was identified as the hazard endpoint for which most animal tests would need to be conducted and which required a large number of animals (van der Jagt et al., 2004).
the individual steps involved in the skin sensitization process are illustrated in File A (Supplementary Data at www.altexedition.org).In short, the chemical must penetrate the skin to react with endogenous proteins, either directly or after activation through enzymatic or oxidative processes.Next, epidermal langerhans cells (lC) and immature dendritic cells (DC) take up and process haptenated proteins.lC cells mature into antigen presenting cells, which after migration to the lymph nodes present haptenized protein fragments to t-cells.
Many research groups are working on the development of alternative tests for skin sensitization (Vandebriel and van loveren, 2010;Aeby et al., 2010).As a chemical's reactivity towards proteins is deemed a key determining factor in its ability to act as a skin sensitizer, a lot of research has focused on in chemico measurements of reactivity with model nucleophiles.
Various nucleophiles are used and experiments are either done by direct peptide reactivity measurements (Gerberick et al., 2008), semi-kinetic (Aptula et al., 2006) or more complex kinetics (Aleksic et al., 2009).In addition, the induction of an antioxidant response element (ARe) dependent gene activity in a recombinant cell line (Natsch et al., 2008) can be used to indirectly characterize reactivity.to further elucidate the skin sensitization induction process, various measures of dendritic cell activation are considered.Recent advances in the in vitro generation of immature dendritic cells and the availability of cell line surrogates with various DC-like characteristics has led to the development of in vitro tests based on the measurement of various cell surface markers or secretion of cytokines modulated upon exposure to chemicals (Aeby et al., 2010;Ryan et al., 2005;lambrechts et al., 2010).Numerous attempts were also made to predict in silico the skin sensitization potential in vivo (Roberts et al., 2007;Patlewicz et al., 2007Patlewicz et al., , 2008;;Patlewicz and mechanistic understanding of the underlying steps involved in skin sensitization induction, the availability of several non-animal tests characterizing these steps, as well as the overall importance of skin sensitization in the context of safety evaluation (Basketter and Kimber, 2009).to this end, we developed an ItS for skin sensitization potential with the specific goal of estimating potency in the mouse llNA (local lymph Node Assay).

Dataset
Representative assays for each of the steps in the induction of skin sensitization (File A in supplementary data; www.altex-edition.org)except the t-cell recognition step, for which no assay data are available, were chosen as inputs to the integrated testing strategy.the data set consisted of responses of 142 chemicals in the following tests: epidermal bioavailability data, peptide reactivity assays, dendritic cell activation, and tIMeS predictions.the complete data set, together with llNA data, is available in File B (supplementary data; www.altex-edition.org).
Information target: Murine Local Lymph Node Assay llNA data (OeCD testing guideline 429, http://www.oecdilibrary.org/content/book/9789264071100-en) were compiled from multiple sources, which included the published literature (Gerberick et al., 2005;Kern et al., 2010) and previously unpublished data from several laboratories.the chemicals were chosen based on quality of llNA studies and availability of data.the data comprise a variety of chemical classes, including fragrances, preservatives, dyes, dye-precursors, halogenated alkanes, and solvents, and cover a wide range of physico-chemical properties.Out of the 142 chemicals, 37 were non-sensitizers and 105 were sensitizers.The four-way classification schemenon-sensitizing (NS), weak (W), moderate (M), and strong (S) (which also included extreme sensitizers) (Kimber et al., 2003) -was used to characterize potency in the LLNA.

Epidermal bioavailability
Finite and infinite dose variables were considered.Finite doserelated variables were calculated using a transdermal transport model (Kasting et al., 2008).During simulation of a single exposure, free and total maximum mid-epidermal concentrations, Cfree and Cmax (µmol/cm 3 ), as well as % systemically absorbed, were calculated.The infinite dose related variables, Kow and Kp (permeability coefficient), were estimated using KOWWIN and ePIWIN (v.1.6.7)software.Bioavailability data were generated for two exposures (V1=1 mg/cm 2 ), V2=10 mg/ cm 2 ) that cover a range of chemical exposures relevant to typical consumer products.

Direct Peptide Reactivity Assay (DPRA)
Peptide reactivity data were generated using a method to measure reactivity of a test chemical with model hepta-peptides containing lysine (lys) or cysteine (Cys) (Gerberick et al., 2004).Peptide reactivity is reported as a percent depletion based on the decrease in free peptide concentration in the sample.
to obtain a prediction of skin sensitization potential.Following Jowsey et al., Maxwell and Mackay (2008) developed a mechanistic model of skin allergy using a systems biology approach.While the model provided a valuable mechanistic hypothesis, its use in risk assessment is limited due to its need for a large amount of experimental data.Natsch et al. (2009) combined two in vitro measurements with in silico predictions into a yes/no classification model.Recently, Nukada et al. (2010) combined data from a dendritic cell activation assay with peptide reactivity data using a rule-based scoring system.
Basketter and Kimber (2009) reviewed the current state of the art of in vitro alternatives for skin sensitization and updated the Jowsey et al. (2006) proposal.In parallel, several groups are pursuing a different route to explain skin sensitization effects in vivo (e.g., Roberts and Patlewicz, 2009).Roberts et al. (2008) explore the concept of a molecular initiating event that is represented by covalent chemical binding with "protein" and focus on interpretation of this step to explain sensitization, considering cell-based assays only for stages downstream of the reactivity step.there is no consensus on the relative merits of different proposed frameworks.
A data integration framework is already a goal on its own and useful in risk assessment (Maxwell and Mackay, 2008).In this paper we are pursuing the closely related, but broader in scope, goal of constructing an Integrated testing Strategy (ItS).ItS requires developing a data integration framework allowing for the synthesis of information in a cumulative manner and guiding testing in such a way that information gain in a testing sequence is maximized.In narrative terms, ItS can be described as combinations of tests in a battery covering relevant mechanistic steps and organized in a logical, hypothesis-driven decision scheme, which is required to make efficient use of generated data and to gain a comprehensive information basis for making decisions regarding hazard or risk (Jaworska and Hoffmann, 2010).the importance the information target formulation is discussed in Jaworska and Hoffmann (2010).ItS for several human health and environmental safety endpoints were outlined in the ReACH technical Guidance Document (tGD).Grindon et al. (2007) further customized the ReACH ItS for skin sensitization potential.Analysis of existing ItS approaches towards the objective to optimize chemical testing can be found in Jaworska et al. (2010).In short, these authors identified the following shortcomings: 1) The use of flow charts as the ItS' underlying structure may lead to inconsistent decisions; 2) there is no underlying methodology to derive consistent and transparent inferences about the information target (e.g., a welldefined toxicological endpoint serving to address hazard), taking into account all available evidence and its interdependence; 3) Moreover, there is no objective guidance, or only purely expert-driven guidance, regarding the choice of the subsequent tests that would maximize information gain in predicting the information target.
the aim of the study was to put the previously developed concepts on data integration and ItS (Jaworska et al., 2010;Jaworska and Hoffmann, 2010), into practice and evaluate their utility in a proof-of-concept case.For this test case we chose skin sensitization potential as the ItS target because of a relatively good edition.org).Next, the structure of the BN and the probabilistic relationships between the variables were extracted directly from the data.the network development consisted of the following steps: 1) transforming the training set into discrete variables; 2) latent variables structure learning; 3) Missing data imputation; 4) Final model structure learning; 5) elucidation of the conditional probabilities, i.e. parameter learning.the network was constructed using the Bayesialab software (www.bayesia.com).
BN Structure learning's objective is to build a graph representing dependence between data, achieving the best fit of data with minimal structural complexity of the net.In the BN language, the variable for which we develop a hypothesis (llNA potency, in this study) is a target variable, while the variables providing evidence (all three of the above listed types of tests with 12 readouts) are referred to as manifest variables.In addition to manifest variables, the latent variables Bioavailability, Dendritic cells and Reactivity were introduced to the network structure.the latent variables are not observable and are conceptual.they allow the communication of summary results obtained from the network, simplifying the structure of the network by reducing the number of arcs between conditionally dependent variables as well as simplifying numerical computations for the joint probabilities.
Removing chemicals with incomplete records would leave only 45 chemicals for which the full record is available.Hence, 97 chemicals would be ignored and not analyzed further and valuable information would be lost.Skipping such a big portion of data may result in biased estimates, especially in cases when missing data contain information considerably different from the rest of the dataset.Hence, to maximize use of available information, the data gaps in the training set were filled in by imputation.Details of each BN construction step are described in File C (supplementary data; www.altex-edition.org).
The performance of the network in terms of classification performance was evaluated on the test set of 12 chemicals provided in the Supplementary Data.However, we would like

Cell-based ARE assay
ARe data were taken from Natsch et al. (2009).ARec32 is a stable cell line derived from the human MCF7 breast carcinoma cell line (Wang et al., 2006).the average Imax (maximal induction of gene activity reported as fold-induction vs. untreated cells) and the average concentration inducing 1.5-fold enhanced gene activity (eC 1.5) are determined.For the analyses in this paper eC 1.5 values were used and reported as ARe luciferase (luc).(Data reported as >1000 are listed with 2000 µM in our dataset.) Dendritic cell activation the data were generated using the U937 Activation test, an in vitro cell-based skin sensitization screening test which uses the human myeloid cell line U937 (Python et al., 2007).Cell surface CD86 expression and Il-8 secretion are measured as activation markers.
TIMES the tIMeS software (V.2.25.7)(Dimitrov et al., 2005) was run to predict the skin sensitization potential.Predictions based on the parent molecule (tIMeS-P), as well as considering potential skin metabolism (tIMeS-M), were investigated in the study.
Out of the 142 x12 records, 14.2% were missing.Specifically, only 70 chemicals had dendritic cell activation data (i.e.CD86 and Il-8 data), thus 72*2/284=51% records were missing.For Reactivity data 93; for luc 75, for Cys 8 and for lys 10 records were missing.there were no missing records for Bioavailability as all data were generated in silico.Only 45 out of 142 chemicals had complete data records for all tests.the abbreviated input variables' names and their units are presented in table 1.

Bayesian Network construction
Prior to the network construction, the tests considered as inputs were mapped onto a mechanistic scheme of the skin sensitization induction process as described in Basketter and Kimber (2009) and described in File A (supplementary data; www.altex- 3 Results

Bayesian Network construction
Input data transformation to discrete values the histograms representing the discretized training data set according to the process described in the Supplementary Data are shown in Figure 1.

Elucidation of the latent variables
The unsupervised clustering algorithm identified 2 clusters that were biologically meaningful.The first cluster contained variables associated with the Bioavailability variables (1 st latent variable): MW, epi conc., Cfree, Kp and Kow. the second cluster contained variables associated with the Reactivity variables (2 nd latent variable): lys, Cys, luc and tIMeS-M or tIMeS-P.For CD86 and Il-8 variables no meaningful clustering was found.Failing of the automatic clustering of these variables is a result of large data gaps in the original data set.Aiming for the mechanistic interpretation of the latent variables, we manually added Dose abs to the bioavailability cluster, while the 3 rd cluster was manually created with Dendritic cells data (3 rd latent variable): CD86 and Il-8.
The structure learning algorithm identified local networks for each latent variable in the form of Naïve Bayes (Fig. 2).A joint probability distribution was calculated for each cluster to represent a latent variable.

Missing data imputation
Chemicals with no missing data were selected for the Bioavailability and Dendritic cells clusters.For the Reactivity cluster the filtering was done considering only Cys and Lys, and not Luc, to emphasise that the value of using the network is far more than a prediction framework.the network represents key steps of the skin sensitization process and it can be queried to find a variety of options to develop a mechanistically interpretable testing strategy.Finding equivalent tests, assessing the value of adding an additional test when a related one is known, and demonstrating the evolution of the testing strategy based on the amounting evidence are useful features of the network approach.

Methodology to guide testing
Value of Information (VoI) measures and one-step look-ahead hypothesis were used as the methodology to guide testing.the one-step look-ahead hypothesis calculates the VoI from all possible individual information sources and chooses the one for which the information gain about the target variable is maximized.the foundation of this reasoning is the analysis of the changes in the probability distribution of the information target given a set of existing data versus generation of new data.In this study, we use relative mutual information MI (x, Y) between variable x and Y to measure VoI.MI measures the amount of uncertainty in Y, which is removed by knowing x. this corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing x variable provides about the Y. the relativity refers to % of entropy of the parent node Y, H(Y), reduced by knowledge of x. thus, relative MI amounts to MI(x,Y)/H(Y) and is expressed in %.In the remainder of the paper relative MI is abbreviated as MI.For more technical information, see File C in supplementary data; www.altex-edition.org.because there were no chemicals containing full records.Based on the resulting local data sets, all missing values were filled using the eM algorithm (Meng and Rubin, 1993) that allowed for an amended but complete new training set.Now the data were prepared for final model structure and parameters learning.
The final Bayesian Network structure the network is able to follow the skin sensitization process by choosing a test sequence representing individual steps in the process.the BN ItS structure represents a Hierarchical Naïve Bayes classifier (Langseth and Nielsen, 2006) except that the tIMeS node is connected to both Reactivity cluster and directly to the hypothesis variable llNA (Fig. 3).It represents advancement over a popular Naïve Bayes (NB) classifier that assumes independence among manifest variables and ignores dependence between tests that translates to information duplication.HBN models have been shown to improve classification accuracy over NB by introducing latent variables to account for conditional dependence between manifest variables (langseth and Nielsen, 2006), as well as for data heterogeneity between the clusters representing latent variables (Demichelis et al., 2006).

Value of Information analysis
Test Ranking based on Mutual Information (MI) with LLNA the global ranking ordered all tests regardless of the llNA potency group or state.the local ranking information ranked the tests differentiating possible llNA states and can be used to advise on the next test to further refine a particular hypothesis, for example, that a chemical is a weak sensitizer.We can evaluate which test is the most informative globally as a starting point and afterwards refine the hypothesis suggested by local ranking.
From the global view the Reactivity latent variable was the most informative and contained more information explaining llNA activity than the Bioavailability and the Dendritic cells variables together.On the level of latent variables, we observed that inclusion of the tIMeS-M model in the network improved the mutual information of the Reactivity latent variable by 35%.Presence of the tIMeS-M model corrected the joint probabil-ity distribution of the Reactivity latent variable so it can better predict llNA potency in the global and per llNA state analysis (except for weak sensitizers).From the local ranking we observed different patterns of importance for different potency classes (tab.2).Interestingly, the Bioavailability profoundly dominated the ability to explain weak sensitizers.Dendritic cells were always the least informative latent variable except in the case of strong sensitizers, for which they came in as second most informative (latent) variable.We emphasize that results for   llNA both globally and per state.this means that based on the available evidence, both tests can be used interchangeably to learn about the llNA potency.In addition, generating evidence on both tests did not advance our knowledge about llNA potency, e.g., MI (llNA, Cys or luc) = MI (llNA, Cys and luc) (data not shown).Given that this conclusion was reached with many data gaps for Luc, confirmation with more data is needed.

Value of adding Lys test if Cys result is known
In BN one can study not only the value of adding additional evidence, as discussed above, but also how an additional test result changes the hypothesis about the target distribution by directly observing changes in the posterior distribution.We illustrated this by studying llNA posterior distribution changes after combining evidence from Cys reactivity with lys reactivity.First, the impact of providing the network Cys results was examined.
According to the discretization (Fig. 2) 3 ranges, i.e., C1 ≤21%, C2 [21-70]%, C3 >71% depletion in the reactivity assay, covered all possible Cys results.If result C1 was obtained, the chance that the evaluated chemical is a non-sensitizer increased from 26% to 56% (Fig. 4), while it decreased to 15% to be a moderate, and to 7% to be a strong sensitizer.Note that the chance of obtaining a Lys ≤13% depletion increased to 94%, indicating a strong dependence between C1 and l1 results.Subsequently information on lys was added.three simulations were carried out, one for each possible state to study differences in resulting distributions for llNA (Fig. 5).If lys was L1 (≤13%) then very little further refinement was obtained regarding the llNA potency.the probability of the hypothesis that a chemical is a non-sensitizer changes from 56% to 59%. this was a consequence of a high conditional dependence between Cys=C1 and lys=l1, Pr(l1|C1)=0.93.However if the result for lys was l2 [13-29% depletion] or l3 >29% depletion), the llNA distributions shifted from non-sensitizer centered towards moderate (35% for both l2 and l3) or strong sensitizer (26% for l2 and 27% for l3).Dendritic cells may be biased as a result of so many data gaps and need confirmation with more data.
Next, we analyzed manifest variables MIs with llNA both globally and per state (tab.3).Based on MI values, all types of variables carried more VoI for NS class than for other classes.We studied the rankings with and without tIMeS-M in the network.Inclusion of tIMeS-M increased the MIs between all manifest variables belonging to the Reactivity in the network, meaning that the tIMeS-M model corrected the joint probability distribution of Cys, lys and luc. the MI index and high rankings of tIMeS-M compared to other Reactivity variables are inflated because of the 72% overlap between chemicals in the tIMeS training set and the training set used in this study.In other words tIMeS already had "seen" 72% of llNA data and learned rules from these data.In contrast, the experimental methods are entirely unbiased.As a consequence, comparison of VoI carried by tIMeS with VoI of experimental data is not fair based only on these values.to address this, we investigated performance of variants of the network with and without tIMeS later in the paper.
Among Bioavailability manifest variables, the most informative variable was the Cfree and the next was Dose abs.However, due to empirical formulation of partitioning in the trans-dermal transport model equations (Kasting et al., 2008), it is premature to ascribe Cfree as the key bioavailability-related driver for skin sensitization.Nevertheless, both Cfree and Dose abs carried more information than Kow and MW and demonstrated the value of including finite dose exposure calculations in BN ITS for llNA potency assessment.

Test may carry equivalent information towards explaining the target
MI can be used to determine whether two different tests carry equivalent information towards explaining the target and whether there is added value to conducting a second test when one of the tests is available.We illustrated this with the peptide reactivity tests.Cys and luc had very similar MI with tive and that different parts of the network can be interrogated on the fly as the whole network will update itself.By this, we mean that all the probability distributions for all the nodes of the network, not only the target node, are updated.

Toxicity signatures
We generated toxicity signatures for each llNA state.A toxicity signature is a fingerprint consisting of manifest variables with values that maximize probability for a particular llNA state (tab.4).

BN ITS classification performance
BN ITS classification performance was assessed via AUC of ROC (Area Under the Curve of Receiver Operating Characteristic) curves (tab.5).Due to the fact that we have a fourway classification model, the AUC indices larger than 25% are better than a random guess.three variants of the BN ItS were examined: a) with tIMeS-P; b) with tIMeS-M; and c) without tIMeS.these comparisons were completed for two exposures: V1 and V2.In all cases, while the overall structure of the network was the same, changes in the performance were noted.the with and without tIMeS performed similarly well when predicting NS class.the network without tIMeS, however, performed worse for W, M, and S classes.this suggests that tIMeS and experimental reactivity data are about equivalent to predict NS class.this also suggests that tIMeS is a very valuable component of ItS for predictions of If the result for Cys was larger than 21%, i.e., either the C2 or C3 state was observed, the results were similar to those for C1. the C2 result alone yielded NS=3%, W=22%, M=42%, S=34%, while the C3 result alone yielded NS= 4%, W=22%, M=41%, S=33%, respectively.Further, changes of the llNA distribution after evidence on lys was added are smaller than 2% for all the states.the above analyses suggest no value in conducting the lys test if Cys reactivity is greater than 21%.However, there is a value in conducting the lys test when Cys reactivity is smaller than 21%.Out of 142 chemicals in the training set, 57 have Cys reactivity values <21%, the majority being non-sensitizers.

Testing strategy depends on the initial information and changes based on incoming new information in an adaptive manner
Frequently a full record for the assessed chemical is not available.In a BN setting, an initial hypothesis can always be generated based on incomplete evidence.testing should start with a test having the highest MI with the target among all available tests.After obtaining the result from one test (or several, if one chooses so) the hypothesis about the target can be revised and the calculation of MI repeated.Figures 6, 7, and 8 illustrate a sequential testing strategy for 2,5-toluenediamine sulfate (PtD) guided by MI.Figures 6, 7, and 8 are integral to understanding how the network works and its potential to alleviate unnecessary animal testing.they aim to show that the process is itera-  3 we see that Epi conc will have no influence on LLNA posterior.Thus we stop and conclude that the chemical is either moderate (48%) or a strong sensitizer (50%).To further refine this hypothesis, other evidence then considered tests in this ITS, has to be provided.tween non-sensitizers and sensitizers suffices.Our network can easily be adopted for this purpose by pooling weak, moderate, and strong into one class: sensitizers.the advantage of pooling together at the decision-making stage and not the modeling stage allows for a more precise identification of non-sensitizers.
Results suggested that tIMeS-M performs slightly better in this setup (tab.6).

BN ITS performance on the test set
We assessed network performance on a small set of 12 chemicals.they represent different chemical classes, e.g., haloalkanes, amines, and acids, overlapping with the training set.However, it is a challenging test set to predict, as some of them are prohaptens (e.g., para-toluene diamine (PtD), aniline) or have unclear reaction mechanisms.For the test chemicals only tIMeS, lys, Cys and Bioavailability-related inputs were available, except for PtD, for which Dendritic cell results were also available.Performance of the networks with tIMeS-M, tIMeS-P and without tIMeS was compared recognizing that the variability in eC3 is in the range 0.5 to 2 times (0.5eC3 to 2eC3) and therefore potency class assignment can be off by one class (tab.7) due to test variability.therefore, we considered as correct exact correct predictions, defined as a match between individual W, M, and S classes for which experimental reactivity data seems less informative.this observation is in line with excellent 88% correct predictions for W and M+S in tIMeS in this study.the analyses of experimental reactivity data by Gerberick et al. (2007) were restricted to binary classifications (Se=88%, Sp=90% on the training set), so their performance to predict individual W, M, and S classes separately was not available.

Different TIMES reference predictions: TIMES-M and TIMES-P
Both tIMeS-M and -P showed strong connection with the Reactivity cluster in addition to the llNA node.In the V1 version, the MI between tIMeS-M and the Reactivity latent variable was 0.42 and tIMeS-P model was 0.29.A stronger link for TIMES-M likely reflects the fact that TIMES-M assesses skin sensitization potency based on the most potent molecule from parent chemical and metabolites, and that reactivity is positively correlated with potency.
The network with TIMES-M gave a better fit for weak and moderate sensitizers than the network with tIMeS-P.Both networks performed similarly for moderate and strong sensitizers.Clearly, consideration of metabolism for the non-sensitizers and weak classes seemed to be important.Further, the network performance with both tIMeS predictions was worse for strong and moderate classes compared to non-sensitizers and weak sensitizers.this result can be partially explained by the fact that tIMeS predicts only three states (none, weak, and moderate/ strong/extreme together), while our network has four states for llNA potency, thus moderate and strong sensitizers are less precisely predicted.
For some chemical management decisions, for example Regulation eC No 1272/2008 as required for ReACH, distinction be-  Tab.4: Toxicity signatures based on the manifest variables for each of the LLNA states Units for the variables are provided in Table 1.These numbers should be treated in a qualitative manner as they reflect data in the training set only.

Discussion
this study presented an attempt to construct and test an integrated testing strategy following up on the concepts laid out in earlier works.the goals of the study were achieved and the BN ItS for llNA potency was constructed despite a challenging data set with many data gaps.the developed BN ItS combined prior biological knowledge with heterogeneous experimental in silico, in chemico and in vitro evidence and generated a probabilistic hypothesis about potency of a chemical in the llNA assay.It can be used purely for data integration and combined inference, as well as an adaptive testing strategy guiding tool.Bessems (2009) and many other authors acknowledge limitations of alternative assays to provide replacement for an in vivo study and recommend shifting focus towards reduction and refinement.The BN ITS framework can be viewed as a reduction strategy, as chemicals with clear potency can be separated from chemicals for which more evidence needs to be generated.the approach carries resemblance to current trends in clinical trial design that strive towards optimizing efficiency and increasingly rely on adaptive Bayesian design (Berry et al., 2010).the BN ItS for llNA potency provided better predictions compared to earlier approaches assessing llNA and at the same time offered new insights to testing strategies.the framework formulated a flexible, adaptive testing strategy.It offers objective guidance on how to identify situations in which generating additional data would not reduce uncertainty about the target.the results clearly showed that there is no one best test sequence, but rather that testing strategy depends on chemical structure, exposure, and initial information.Value of Information analysis demonstrated that differences in VoI rankings depend on potency of a chemical.the section Testing strategy depends on the initial information and changes based on incoming new information in an adaptive manner under 3.2 and table 7 further demonstrate that the BN ItS not experimental and the class prediction with the highest probability, as well as a number of predictions off by one class.the network with tIMeS-P performed similarly to the network with tIMeS-M.It predicted 6 exact matches and 5 one class off matches (92% total correct), while the network with TIMES-M had five exact predictions and five one-class off predictions (83% total correct).the network without times had three exact matches, six one-class off predictions and two two-classes off predictions (75% total correct).the more detailed analysis revealed differences in llNA posterior probability distributions for the tIMeS-P and tIMeS-M networks for the chemicals for which tIMeS-P and tIMeS-M class predictions were the same (tab.7).All test set molecules were part of the tIMeS training set. the prediction of potencies of pro-haptens, i.e., aniline, PtD, and pyridine, improved with tIMeS-M.It is most clear for the potent prohapten sensitizer PtD, for which prediction without tIMeS did not result in a clear identification as a potent sensitizer.For weak sensitizers (aniline, pyridine), consideration of metabolism was less impactful as expected.
Most llNA predictions (e.g., dimethyl sulfate) have a unimodal pattern, whereas some (e.g., benzoic acid) reveal a bimodal pattern.Bimodality is a sign that the chemical input data reveals a pattern unseen in the training dataset, thus likely to be outside the model domain.the other possibility is that the input data are in conflict with each other due to some error.the potential errors can be experimental errors or prediction errors for in silico tests.For the benzoic acid both tIMeS and reactivity data suggested that the chemical is a non-sensitizer, but bioavailability data were in conflict with strong evidence for a strong class.In this case, bioavailability data were unreliable because the epidermal bioavailability model (Kasting et al., 2008) is not suitable for acids.In general, data conflict can be used for the purpose of quality assurance of the overall prediction.ItS framework cannot suggest tests that are not a part of the network.Such problems can only be solved with approaches such as that of Maxwell and Mackay (2008).In addition, BN ItS in its current form, by the virtue of the underlying input information, only considers metabolism through tIMeS-M.
In the VoI we analysed the MI rankings for latent and manifest variables to compare their relative importance in explaining llNA potency.As postulated in previous studies (Roberts et al., 2007) reactivity characterization is very important in predicting sensitization.The MI rankings not only confirmed this postulate but in addition quantified the importance.However, the bioavailability, and not reactivity, appeared as the major driver determining that a chemical is a weak sensitizer.this exception for chemicals with Kow >3.9 and MW >185 Da suggested poor dermal penetration irrespective of reactivity profile.Our results regarding lack of dendritic cell importance came as a surprise.In other studies, dendritic cell data correlated well with llNA data (Nukada et al., 2010;lambrechts et al., 2010).these results should, therefore, be interpreted with caution, especially since 50% of records were missing dendritic cell data.Among manifest variables tIMeS-M was the most informative.this result is biased due to tIMeS being already trained on a part of the ItS training set.tIMeS was followed by Cys and luc tests that carried similar VoI, with respect to llNA.As said earlier, all variables carried more VoI for NS class than for the sensitizing classes.this is not surprising, as it shows that the net is best suited to discriminate NS from sensitizers, which is a biological distinction.the split between W, M, and S is based on an arbitrary cutoff based on the LLNA data.It is more difficult for a model to separate different potency classes, as compared to discriminating non-sensitizers from sensitizers.However, while separating NS from sensitizers is often sufficient for hazard identification within regulatory requirements, the discrimination of potency classes remains a very important aspect in model development, as it is critical for conducting skin sensitization risk assessments.these conclusions are limited to the analyzed data set and require further analysis with more data.
Further, we showed how MI can be used in sequential testing to determine when a follow-up test may add value.thus, Lys data generated under defined conditions adds value to existing Cys data.Our results suggested no value in conducting the lys test if Cys reactivity was greater than 21% and the important contribution of lys data in explaining llNA potency when Cys reactivity was smaller than 21%.These findings confirm Alvarez-Sanchez et al. (2003) andEilstein et al. (2006) observations regarding importance of considering various amino acid nucleophiles to understand skin sensitization. the importance of studying reactivity towards lys was already discussed by Gerberick et al. (2004) and further investigated by troutman et al. ( 2011) who concluded that lys was important for molecules with specific reactivity towards NH2 groups (e.g., anhydrides, isocyanates).Since this study evaluated data in the context of potency, further work is needed to link these two different views -potency oriented and chemistry oriented results.
only calculates different VoI based on the potency class but it eventually is chemical-specific as the individual chemical is associated with its unique biological fingerprint, resulting in a particular unique llNA probability distribution.therefore, mandating a single, generic set of tests as a replacement strategy is unlikely to be efficient.
Suitability of BNs as underpinning methodology to ItS development has been discussed previously (Jaworska et al., 2010).Use of a Bayesian network, as a formal framework, provides a basis for consistent and transparent reasoning when integrating different, incomplete, and conflicting data.The network is able to follow the skin sensitization process by choosing a test sequence representing individual steps in the process.Because it is a probabilistic approach it allows 1) addressing uncertainty in the biological knowledge, 2) combining heterogeneous evidence, and 3) quantifying uncertainty about target and relationships.Uncertainty in relationships is characterized in probabilistic terms as Conditional Probability tables (CPts).Differentiating between strong and weak evidence can be accomplished by different shapes of evidence distributions.In this study we used the simplest form of evidence, allocating it always to one class, but in general, it is possible to allocate evidence to more than one class.Only by quantifying uncertainty about target and relationships can we develop strategies to objectively and effectively reduce it.
In addition to prediction functionality, BNs allow different analyses (e.g., evidence sensitivity, Value of Information analysis) that can be used to guide testing.Further, this framework yields a refined model, as new data become available without discarding old data.As such, it provides functionality for reassessment of predictions in light of additional evidence or to reason with incomplete data.We inferred the potency in the llNA given full and partial input data and showed testing strategy guided by Value of Information (VoI) calculations.Our specific case study results showed that the integration of biological knowledge with data in the form of a BN ItS is a step forward in making efficient use of alternative data and has potential to become a practical part of a toxicologist's toolbox.
Determining the causal structure is a key for mechanistic interpretation capability of ItS. the structure of the developed network reflected the current knowledge about skin sensitization and included key processes, such as dermal penetration, reaction with proteins, and dendritic cell activation (Jowsey et al., 2006).BNs are interpreted as causal models where causal model is understood as a model that conveys causal assumptions, not necessarily a model that produces validated causal conclusions (Pearl, 1988).However, the BN ItS framework also requires quantification of the relationships between nodes.In this study, a data-driven approach was used to quantify the relationships.thus we regard the used training set as the applicability domain of the constructed ItS.Since its outcomes strongly depend on the quality and the appropriateness of the input information, the choice of tests, and the underlying training set of chemicals, expert knowledge plays an important role in assessing the quality and relevance of input information.BN

Fig. 1 :
Fig. 1: Histograms representing the discretized training data set.This is the initial state, from which all further network analyses are conducted.

Fig. 2 :
Fig. 2: Latent variables: a) Bioavailability; b) Reactivity; c) Dendritic cells local networks Each arc is tagged with an MI value between the nodes it connects.

Fig. 3 :
Fig. 3: Final BN ITS for LLNA potencyTo calculate LLNA potency probability distribution one needs to compute Bioavailability, Reactivity, Dendritic cells and TIMES probability distributions.Every latent variable explains cumulative influence of the manifest variables attached to it with respect to the LLNA node.Each arc is tagged with an MI value between the nodes it connects.

Fig
Fig. 4: LLNA probability distributions a) before evidence for Cys was provided; b) after evidence for Cys equal C1 (i.e.we are 100% sure that it was C1) was provided

Fig. 6 :
Fig. 6: Adaptive test strategy for 2,5-Toluenediamine sulfate (PTD) guided by MI -step 1 In the first step guided by Table 2 we provided information from in silico test TIMES-M.Posteriors for LLNA, manifest variables and MI values were updated and guide to get Cys data: (a) MIs in BN ITS after step 1; (b) LLNA and manifest posterior distributions after step 1.

Fig. 7 :
Fig. 7: Adaptive test strategy for 2,5-Toluenediamine sulfate (PTD) guided by MI -step 2 Cysteine data was provided and subsequently LLNA and remaining manifest variables posteriors, and MIs were updated again and guide to provide Cfree: (a) MIs in BN ITS after step 2; (b) LLNA and manifest posterior distributions after step 2.

Tab. 7 :
BN ITS predictions on the test set on the test set using TIMES-M and TIMES-