Open Source Software Implementation of an Integrated Testing Strategy for Skin Sensitization Potency Based on a Bayesian Network 1

Summary An open-source implementation of a previously published integrated testing strategy (ITS) for skin sensitization using a Bayesian network has been developed using R, a free and open-source statistical computing language. The ITS model provides probabilistic predictions of skin sensitization potency based on in silico and in vitro information as well as skin penetration characteristics from a published bioavailability model (Kasting et al., 2008). The structure of the Bayesian network was designed to be consistent with the adverse outcome pathway published by the OECD (Jaworska et al., 2011, 2013). In this paper, the previously published data set (Jaworska et al., 2013) is improved by two data corrections and a modified application of the Kasting model. The new data set implemented in the original commercial software package and the new R version produced consistent results. The data and a fully documented version of the code are publicly available (http://ntp.niehs.nih.gov/go/its).


Toxicity testing in the 21
st Century is purposefully transitioning from traditional disease-related observations in animal models and increasingly towards the use of mechanism-based outcomes from cell-based assays and in silico models.However, it is unlikely that a single assay or in silico model will provide sufficient information on the risk or hazard posed by a chemical.Therefore, data from multiple inputs need to be integrated in a way that maximizes the utility of the available information.A Bayesian network is a graphical model that enables integration of data from multiple sources in a transparent and intuitive way.In situations where available data is incomplete or uncertain, Bayesian networks provide a coherent probabilistic framework for reasoning and guiding decisions on the classification of a substance or the need for additional testing.
The integrated testing strategy (ITS) using a Bayesian network for skin sensitization was previously developed using commercial software (Jaworska et al., 2011(Jaworska et al., , 2013)).The use of commercial software is convenient in some settings, but can limit the utility and awareness of an approach by obscuring the details of the analysis.Without full access to the code and data used to generate the model, it is difficult for others to test, verify, and build on the model.Transparency was identified as one of the most important conceptual requirements of a successful ITS (Jaworska and Hoffmann, 2010).Accordingly, we developed an implementation of the Bayesian network ITS for skin sensitization using the free and open-source statistical programming language R (R v3.0.1, GNU Public License v3).
A categorical representation of a compound's potency in the murine local lymph node assay (LLNA) is used as the target endpoint (Tab. 1) in the original Bayesian network ITS models (Jaworska et al., 2011(Jaworska et al., , 2013)).Relative to other sensitization assays, the LLNA achieves a reduction in number of animals used, less discomfort associated with a positive response and time required for completion, as well as providing a quantitative measure of skin sensitization potency.The LLNA is an internationally accepted method for assessing skin sensitization hazard (OECD, 2010).
The structure of the Bayesian network was designed to be consistent with the adverse outcome pathway (AOP) for substances that initiate the skin sensitization process by covalently binding to skin proteins (Jaworska et al., 2011(Jaworska et al., , 2013)).There are four key events in the AOP.In order of occurrence they are: 1) covalent binding to skin proteins, 2) inflammatory responses in the keratinocyte, 3) activation of dendritic cells, and 4) T-cell proliferation ALTEX Online first published March 31, 2014 http://dx.doi.org/10.14573/altex.1310151(OECD, 2012);.Table 2 links these events to the nodes (variables) found in the ITS structure (Jaworska et al., 2013) shown in Figure 1.
In a previous paper (Jaworska et al., 2013) that described the ITS-2 model, both lipid and polar skin diffusion pathways were used for the bioavailability calculations and incorporated in an MS Excel version of the Kasting skin penetration model (Dancik et al., 2013).The bioavailability calculations for the lipid diffusion pathway are publicly accessible on the National Institute for Occupational Safety and Health website (http://www.cdc.gov/niosh/topics/skin/finiteSkinPermCalc.html), but the polar skin diffusion pathway module is under development and not yet publicly available.Upon re-evaluation of the model, the contributed value of the polar skin diffusion pathway was not clear.Therefore, that pathway was dropped and the (minor) changes to the bioavailability nodes are contained in the current data set.Additionally, two errors in the direct peptide reactivity assay (DPRA) data for benzoic acid (training set) and imidazolidinyl urea (test set) were corrected.This revised model is referred to as ITS-2 Lipid.
Application of the Bayesian network requires three distinct computational steps as outlined in Figure 2. First, a supervised discretization algorithm is used to find cut-points that bin the continuous assay data in the training data into intervals.The test data set cannot be used to find the discretization cut points, since doing so would result in biased and overly optimistic prediction results.The cut-points found for the training data are used to discretize the test data.Second, mechanistically related assays are clustered to form latent (unobserved) variables.The discretized logKow, AUC120, and Cfree variables are clustered to form the Bioavailability latent variables.Similarly, the discretized results from the CD86, KEC3, and KEC1.5 assay results are clustered to Cysteine latent variable.Forming latent variables increases the interpretability of the network, while at the same time reducing its computational complexity.Finally, the relationships among variables in the discretized training data (including the latent variables) are described and quantitated using a Bayesian network.The Bayesian network has a qualitative and quantitative component.The qualitative part consists of a directed acyclic graph where each node represents an assay and each edge (arrow) indicates that there is a relationship between the variables it connects (Fig. 1).The strength of each interaction is given by a set of conditional probability tables, one for each node, which makes up the quantitative portion of the network.The resulting model can be used to make LLNA potency category predictions for new chemicals.
As statistics and machine learning are the primary application domains of R, we were able to find high-quality R packages implementing each of the steps shown in Figure 2. The discretization package (Kim, 2012) contains implementations of several algorithms for supervised discretization (step 1).The Bioavailability and Cysteine latent variables were learned (step 2) using tools from the poLCA package (Linzer and Lewis, 2011).Finally, gRbase (Dethlefsen and Højsgaard, 2005) and gRain (Højsgaard, 2012) supply the functions for constructing, parameterizing and performing inference on Bayesian networks (step 3).These packages and their use are discussed in detail in the code documentation found on the National Toxicology Program (NTP) web site (http://ntp.niehs.nih.gov/go/its).
Construction and validation of the open-source ITS network occurred in three stages.First, we demonstrated that the tools available in R for building the probability tables and performing inference on Bayesian networks could give identical results to those obtained using the commercial software.This was done by feeding the results of steps 1 and 2 obtained from the commercial software into step 3 of the R model.With the same discretization cut-points and latent node values, the LLNA category predictions for the training and test datasets were identical for the R implementation and the commercial software.In the second stage, we compare the latent variable learning algorithm used in the R model with that used by the commercial software package.The exact algorithm used by the commercial software is not known, but we compared the methods by using the dataset discretized by the commercial software package as the input to the latent variable learning algorithm implemented in R (step 2).The R and commercial algorithms grouped the training set chemicals in the same way to form the latent variables.Finally, we used R versions of widely used algorithms for supervised discretization and latent class learning (steps 1 and 2), and then built the network in R as well (step 3).The overall classification accuracies between the R-based method and the commercial software package were the same, with three compounds misclassified by both methods.The predictions were not identical, however, as two compounds were classified differently by the two methods.Dihydroeugenol (2methoxy-4-propyl-phenol) (CASRN 2785-87-7) was correctly classified as a moderate sensitizer by the R-based method and incorrectly classified as a strong sensitizer by the commercial software.Citral (CASRN 5392-40-5) was incorrectly classified as a weak sensitizer by the R-based method and correctly classified as a moderate sensitizer by the commercial software package.
The open-source R-based model with computational details is available on the NTP website (http://ntp.niehs.nih.gov/go/its),where interested users with intermediate R programming skills can access: 1) the R code and detailed documentation, 2) the current data set (with corrections) used to train and test the model, and 3) output data to verify their computations.The model is documented using the Sweave application (Leisch, 2002), which produces a "dynamic document" that integrates R code and expository text, thereby making each analysis step explicit.The Sweave document coupled with input and output data sets makes the Bayesian network ITS model independently reproducible (Fomel and Claerbout, 2009;Koenker and Zeileis, 2009;Walters, 2013;Peng, 2009).
The modular nature of the network programmed in R allows for alternative in silico and in chemico modules to serve as inputs to the Bayesian network ITS, which would then need to be retrained.Substituting in vitro modules could be more challenging, as that could potentially affect the structure of the network, but represents another opportunity provided by this transparent and flexible approach.A forum to exchange information and updates to the R

Fig. 1 :
Abbreviations: AOP = adverse outcome pathway (OECD, 2012); EC150 = effective concentration that produces a 1.5-fold increase in the CD86 cell surface marker expression, the threshold for a positive response; EC3 = effective concentration that produces a stimulation index of 3, the threshold for a positive response in the LLNA; LLNA = murine local lymph node assay.