The GARD Platform for Potency Assessment of Skin Sensitizing Chemicals

539 Received January 10, 2017; Accepted March 28, 2017; Epub April 12, 2017; doi:10.14573/altex.1701101 The molecular events leading to skin sensitization and consequently to ACD can be characterized by a number of sequential key events (KE) triggered by a chemical, and have been summarized in an adverse outcome pathway (AOP) as described by the OECD (2012). The initiating event (KE1) is defined as covalent protein modification by the skin sensitizing chemical after it has gained access to deeper skin layers. The following KE2 represents the inflammatory responses upon activation of keratinocytes. KE3 corresponds to the activation of dendritic cells, which in turn leads to activation and proliferation of T cells (KE4). Upon re-exposure to the sensitizer, the development of ACD may be triggered, which is characterized by skin lesions induced by specific Th1 and CD8+ T cells. While the KE in the AOP are well described, a detailed mechanistic understanding of the underlying biology of the individual key events is still missing (OECD, 2012). The murine Local Lymph Node Assay (LLNA) (Gerberick et al., 2007), followed by the Guinea Pig Maximization Test (Magnusson and Kligman, 1969), has for many years been the


Introduction
Skin sensitization and associated diseases such as contact allergy affect a substantial portion of the general population with an estimated prevalence of 15-20% in industrialized countries (Peiser et al., 2012).Allergic contact dermatitis (ACD), a type IV hypersensitivity reaction, is common among certain occupational groups such as those regularly exposed to chemicals or involved in wet work (Behroozy and Keegel, 2014).However, also cosmetics and household products can contain numerous skin sensitizing chemicals.The European Union has imposed requirements for testing of >60,000 chemicals in the context of REACH (Hartung and Rovida, 2009), and prohibited animal testing for cosmetics and their ingredients (European Parliament, 2009).Information on both the skin sensitizing capacity and the potency of a chemical has to be provided to meet the regulatory requirements for classification and sub-categorization.Animal-free alternative assays that meet these requirements are urgently needed.
The molecular events leading to skin sensitization and consequently to ACD can be characterized by a number of sequential key events (KE) triggered by a chemical, and have been summarized in an adverse outcome pathway (AOP) as described by the OECD (OECD, 2012).The initiating event (KE1) is defined as covalent protein modification by the skin sensitizing chemical after it has gained access to deeper skin layers.The following KE2 represents the inflammatory responses upon activation of keratinocytes.KE3 corresponds to the activation of dendritic cells, which in turn leads to activation and proliferation of T cells (KE4).Upon re-exposure to the sensitizer, the development of ACD may be triggered, which as the adverse outcome is characterized by skin lesions induced by specific Th1 and CD8+ T cells.While the KE in the AOP are well described, a detailed mechanistic understanding of the underlying biology of the individual key events is still missing (OECD, 2012).
The murine Local Lymph Node Assay (LLNA) (Gerberick et al., 2007) followed by the Guinea Pig Maximization Test (Magnusson and Kligman, 1969) has for many years been the preferred assay for skin sensitization testing as the LLNA is able to provide data for both hazard identification and characterization, including skin sensitizer potency information.However, it is characterized by certain limitations such as susceptibility to vehicle effects and issues with false-positive results (Anderson et al., 2011).Several non-animal predictive methods have been developed to reduce animal experimentation used for chemical testing including computational approaches to integrate data from different test platforms for hazard identification as recently reviewed (Ezendam et al., 2016).Three test methods for skin sensitization are accepted as test guidelines at the OECD; the ARE-NRF2 luciferase method (KeratinoSens™) assay (Andreas et al., 2011;Natsch and Emter, 2008), the Direct Peptide Reactivity Assay (DPRA) (Gerberick et al., 2004) and the human Cell Line Activation Test (h-CLAT) (Ashikaga et al., 2006).In addition to hazard identification, information on skin sensitizer potency is imperative in order to allow quantitative risk assessment and to define exposure limits.Few approaches for the prediction of skin sensitizer potency have been published and were recently reviewed in (Ezendam et al., 2016), such as assays targeting KE2 (epidermal equivalent sensitizer potency assay (Teunis et al., 2014), SENS-IS (Cottrez et al., 2015)) and the U-SENS assay modelling KE3 (Piroird et al., 2015).Furthermore, in silico models, often combining information from several in vitro methods, have been described; for example QSAR (Dearden et al., 2015), artificial neural network (Tsujita-Inoue et al., 2014), probabilistic models and integrated testing strategy (ITS) approaches including a Bayesian model (Jaworska et al., 2015;Jaworska et al., 2013;Luechtefeld et al., 2015;Natsch et al., 2015).
The alternative assay Genomic Allergen Rapid Detection (GARD) for the binary classification of chemicals into skin sensitizers and non-sensitizers is based on global transcriptomic analysis of differential expression in a human myeloid cell line, induced by sensitizing chemicals in comparison to non-sensitizing controls.The resulting biomarker signature, the GARD prediction signature (GPS), consists of 200 transcripts, which are used as input into a support vector machine (SVM) model trained on a set of reference chemicals (Johansson et al., 2011).The changes in transcription can be linked to the maturation and activation of dendritic cells (KE3) during sensitization.In an in-house study based on 26 blinded chemicals, the accuracy of the assay was estimated to 89% (Johansson et al., 2014) primarily based on LLNA reference data.Previous observations indicated that the GARD assay is able to provide information relevant also for potency assessment.Firstly, signalling pathways were differentially regulated depending on the potency of a subset of chemical reactivity groups (Albrekt et al., 2014).Secondly, we observed that more potent sensitizers were generally assigned higher GARD SVM decision values compared to weaker sensitizers, indicating that there were genes within the signature contributing with potency information (unpublished observations).However, the information in the GARD prediction signature were not sufficient to completely stratify chemicals into the well-defined potency groups as described by the Classification, Labelling and Packaging (CLP) Regulation (CLP, 2016).
The CLP regulation is based on the Globally Harmonised System (CLP, 2016), and uses three categories for chemical classification; no category (no cat) for non-sensitizers, category 1B for weak and 1A for strong sensitizers.In the light of the above described observations, it was hypothesized that GARD can be developed further into a tool for the prediction of chemical skin sensitizer potency, targeting the CLP categories.As the established GARD Support Vector Machine model cannot be applied to multiclass problems, we used another approach based on random forest modelling (Breiman, 2001).Random forest is a decision-tree based method and well-suited for microarray data (Díaz-Uriarte and Alvarez de Andrés, 2006).It divides the dataset internally and repeatedly into a training and test set through random sub-sampling (bootstrapping).Samples in the test set, referred to as out-of-bag samples, comprise approximately one third of the entire dataset, and are used in order to estimate the out-of-bag (OOB) error, i.e. the classification error.
Here, we present a new approach to predict skin sensitizer potency according to CLP categories based on supervised machine learning using a random forest model.Firstly, the global gene expression data from a training set comprising 68 unique chemicals and 2 vehicle control samples were used as input into a random forest model.The random forest model was subsequently combined with an algorithm for backward variable elimination.The algorithm initially ranked the variable importance of each transcript from the microarrays, and then iteratively fitted new random forests, while removing the least important variables from the previous iteration.Using this strategy, we were able to identify a set of 52 transcripts with the smallest OOB error rate when predicting the out-of-bag samples from the training set.The predictive performance of the 52 transcripts was challenged with an independent test set containing 18 chemicals previously unseen to the model.The chemicals in this test set could be predicted with an overall accuracy of 78%.In addition to the predictive model, we also demonstrated the versatility of analyzing whole transcriptomes of cells by performing pathway analysis to further improve the mechanistic understanding of skin sensitizing potency on a cellular level, confirming the hypothesis that different chemical reactivity classes induce distinct signalling pathways.

Cells and Flow cytometry
The myeloid cell line used in this study was derived from MUTZ-3 (DSMZ, Braunschweig, Germany) and maintained as described in (Johansson et al., 2013;Johansson et al., 2011).A phenotypic control analysis of the cells prior to each experiment was carried out by flow cytometry in order to confirm the cells' immature state.The following monoclonal antibodies were used: CD1a (DakoCytomation, Glostrup, Denmark), CD34, CD86, HLA-DR (BD Biosciences, San Jose, USA), all FITCconjugated; CD14 (DakoCytomation), CD54, CD80 (BD Biosciences), all PE-conjugated.FITC-and PE-conjugated mouse IgG1 (BD Biosciences) served as isotype controls and propidium iodide as a marker for non-viable cells (BD Biosciences).Three batches of cells, representing three biological replicates, were exposed for 24 hours in independent experiments and viability and CD86 expression were assessed by flow cytometry.All FACS samples were analyzed on a FACSCanto II instrument with FACS Diva software for data acquisition.10 000 events were acquired and further analysis was performed in FCS Express V4 (De Novo Software, Los Angeles, CA).Cells for RNA extraction were lysed in TRIzol® (Life Technologies/Thermo Fisher Scientific, Waltham, USA) and stored until further use in -20°C.

Chemicals and Stimulations
All chemicals were purchased from Sigma Aldrich (Saint Louis, USA) in high purity quality or they were provided by Cosmetics Europe.All chemicals were stored according to the recommendations of the supplier.The chemical stimulations of cells were performed as described earlier (Johansson et al., 2013).In short, GARD input concentrations were defined by solubility and cytotoxicity characteristics of the chemicals.An end concentration of 500 µM was targeted for non-cytotoxic and soluble chemicals and the highest possible concentration for chemicals with limited solubility (lower than 500 µM in medium).Cytotoxic chemicals were used in a concentration targeting a relative viability of cells of 90%.Most chemicals were used from a 1000x pre-dilution in dimethyl sulfoxide (DMSO) or autoclaved MilliQ water.DMSO concentration as vehicle never exceeded 0.1%.DMSO and MilliQ samples were included as vehicle controls in this study and thus belong to the group of non-sensitizer samples.

RNA Extraction, cDNA and Array Hybridization
RNA isolation from cells lysed in TRIzol® was performed according to the manufacturer's instructions.All samples were subjected to quality control using a Bioanalyzer 2100 (Agilent, Santa Clara, United States) prior to hybridization to the microarrays.Labeled sense DNA was synthesized according to Affymetrix (Affymetrix, Cleveland, USA) protocols using the recommended kits and controls.The cDNA was hybridized to Human Gene 1.0 ST arrays (Affymetrix) and further processed and scanned as recommended by the supplier.The new microarray data were merged with historical data (Johansson et al., 2011;Johansson et al., 2014) and together subjected to quality control.The low-quality arrays excluded from downstream processing were identified from the Normalized Unscaled Standard Error (NUSE), which is an established measure to estimate overall variation for a specific array.An array was defined as poor quality if it was centered around 1.1 or had an overall higher spread of the NUSE distribution than the others, according to recommendations provided by Affymetrix.Additionally, the poor quality was also confirmed from the distribution of the arrays in a PCA plot.(Basketter et al., 2014) where available; HP, human potency; NS, non-sensitizer; S, sensitizer; 1-4=S; 5-6=NS.Otherwise according to CLP.See also Table 4.

Chemical classifications and dataset overview
A novel dataset comprising 37 well-characterized chemicals (Table 1) was selected in order to complement historical GARD data for 49 chemicals.The novel chemicals were selected based on European Chemical Agency CLP databases and literature (Basketter et al., 2014;Piroird et al., 2015) and in cooperation with the Skin Tolerance Task Force of Cosmetics Europe (CE), who kindly provided 27 chemicals.In order to build a random forest model, the new microarray data were merged with historical data (Johansson et al., 2011;Johansson et al., 2014), resulting in information from 86 unique chemicals and two vehicle controls (Table 2).Thereof, 68 samples were defined as a training set, and 18 samples, corresponding to six chemicals from each CLP category, were included in the independent test set.For four chemicals, CLP classification 1B was changed to no category/non-sensitizer according to the sources indicated in Table 2 for one of three reasons: i) for retaining consistency with previous GARD projects (benzaldehyde, xylene), ii) for being used as vehicle in non-sensitizing concentration (DMSO), and for being a well-described false-positive in the LLNA (sodium dodecyl sulfate).no cat NS train nf AT, acyl transfer agent; HP, human potency; MA, Michael Acceptor; na, not available; nf, not found; NS, non-sensitizer; S, sensitizer; SB, Schiff base formation; SN2, bi-molecular nucleophilic substitution; SNAr, nucleophilic aromatic substitution 1,3,4 NS according to (Basketter et al., 2014); 2 NS according to Sens-it-iv project (Roggen and Blaauboer, 2013) Binary classifications Binary classifications of the 37 chemicals summarized in Table 1 into sensitizers or non-sensitizers were performed with the previously established model based on SVM, using SCAN-normalized (Piccolo et al., 2012;Piccolo et al., 2013) expression data from the GPS as variable input into the learning algorithm (Johansson et al., 2011) A scaling factor was generated by calculating the ratio of the average expression value for each transcript in DMSO vehicle control samples of the training set and the average expression value for same transcript in DMSO samples in the batch where the test chemical originated.The scaling factor for each transcript was then multiplied with the expression values for the corresponding transcript in the test chemical.SVM predictions were performed as described previously (Forreryd et al., 2016;Johansson et al., 2011).In short, an SVM model based on a linear kernel was trained on reference chemical stimulations from the original training set used during identification of the GPS (Johansson et al., 2011).The trained model was subsequently applied to assign each test chemical with an SVM Decision Value (SVM DV).Resulting SVM DVs for all test chemicals were used to construct a receiver operating characteristics (ROC) curve, and the resulting area under the curve (AUC) was used as a classification measure (Lasko et al., 2005).SVM modeling and ROC curve visualizations were performed in R statistical environment, using the additional packages e1071 (Dimitriadou, 2011) and ROCR (Sing et al., 2005).Prior to evaluating final predictive performance of the model, SVM DVs for each individual replicate of the test chemicals were calibrated against the cut-off for maximal predictive performance obtained during classification of benchmark samples in Table 3, as described in (Forreryd et al., 2016).The calibrated SVM DVs were subsequently used for final classifications, and test chemicals were classified as sensitizers if the median output value of replicates > 0. Accuracy, sensitivity and specificity was estimated using cooper statistics (Cooper et al., 1979).The non-parametric Two Sample Wilcoxon test was performed in order to determine if the SVM DV distributions between CLP categories 1A, 1B and no cat differed significantly.(Basketter et al., 2014).HP, human potency; NS, non-sensitizer; S, Sensitizer

Data Handling and Statistical Analysis
In order to build a random forest model, 68 chemical and two vehicle control samples analyzed as described above with Human Gene 1.0 ST arrays (33297 transcripts, partly named genes or more commonly referred to as variables) were defined as a training set, and 18 samples, six chemicals from each CLP category, were included in the independent test set.Samples in the test set were not included in the construction of the model.The aim was to obtain a balanced training set representing all three CLP categories (Table 4) and different chemical reactivity groups as listed in Table 2 ("Toxtree protein binding class" (Patlewicz et al., 2008;Piroird et al., 2015)).Most of the chemicals in the test set (14 out of 18) originated from the latest experimental campaign (Table 1), comprising 37 chemicals previously not investigated using the GARD assay.In the training set, roughly one third of samples (23 out of 68) were from this latest dataset.The vehicle samples were part of all projects and are thus present with higher replicate numbers.The new microarray data were merged with historical data (Johansson et al., 2011;Johansson et al., 2014), and four arrays were removed due to poor quality; however, no chemical was present in less than biological duplicates, i.e. based on cells derived from at least two different batches.Array data was imported into the R statistical environment and normalized using the SCANfast algorithm (Piccolo et al., 2012;Piccolo et al., 2013).As several experimental campaigns needed to be combined, this dataset was normalized using the ComBat method (Jeffrey T. Leek, 2014;Johnson et al., 2007) in order to remove batch effects between samples.At this time, the samples in the training set were separated from the samples in the test set.To avoid overfitting, only samples in the designated training set were used during identification of the predictive biomarker signature and for fitting of parameters to the classifier, and samples in the test set were set aside to validate the performance of the identified signature and the specified classifier.The predictive biomarker signature was identified by feeding normalized and batch corrected transcript intensities from individual samples in the training set into a random forest model (Breiman, 2001) combined with a backward elimination procedure in the varSelRF package (Diaz-Uriarte, 2007) in R/Bioconductor version 3.1.2.The initial forest used for ranking of variable importance was grown to 2000 trees and all other parameters were kept at the default settings.The package iteratively fits and evaluates random forest models, at each iteration dropping 20% of the least important variables.The best performing set of variables was selected based on OOB error rates from all fitted random forests as the smallest number of transcripts within one standard error from the minimal error solution (i.e 1.s.e rule).The variable selection procedure was validated by estimating the prediction error rate by the .632+bootstrap method of the varSelRF ALTEX Online first published April 12, 2017, version 2 https://doi.org/10.14573/altex.1701101package using 100 bootstrap samples, and the importance of individual transcripts in the biomarker signature was validated by the frequency of appearance in bootstrap samples (referred to as Validation Call Frequencies (VCF)).The predictive performance of the identified biomarker signature was validated by building a new forest in the random forest package (Liaw and Wiener, 2002), based on previous parameters, using only the samples in the training set and the selected transcripts in the biomarker signature as variable input.The model was applied to assign each individual replicate sample in the test set to a CLP category, and the majority vote across the biological replicate stimulations for each chemical was accepted as the predicted category.Heatmaps and PCA plots were constructed in Qlucore Omics Explorer (Qlucore AB, Lund, Sweden) in order to visualize the RNA expression data.

Pathway Analysis
Pathway analysis was performed with the Key Pathway Advisor (KPA) tool version 16.6 (KPA, 2016), which provides a pathway analysis workflow to investigate e.g.gene expression data.It associates differentially expressed genes with both upstream and downstream processes in order to allow biological interpretation.The investigated dataset, consisting of the SCANfast-and ComBat-normalized expression values of the test set and the training set, in total 308 samples, was first variance-filtered in order to remove variables with consistently low variance until approximately a third was left (10009 variables, σ/σmax=0.1478).A multigroup comparison (ANOVA) comparing samples belonging to CLP no cat, 1B, and 1A was then applied in order to identify transcripts that were differentially regulated.The most significant 883 transcripts (false discovery rate FDR=10 -9; p=8.53x10 -11 ) were used as input into KPA (Affymetrix Exon IDs and respective p-value, overconnectivity analysis).In order to identify pathways associated with protein reactivity, the same variance-filtered dataset was filtered based on two-group comparisons (t-tests, Toxtree binding class "no binding" non-sensitizers (81 samples) versus "Michael acceptor" sensitizers (MA, 63 samples), "no binding" versus "Schiff base formation" (SB, 29 samples), "no binding" versus combined "bi-molecular nucleophilic substitution/nucleophilic aromatic substitution" (SN, 25 samples).Lists with the 500 most significantly regulated variables from each comparison, together with p-value and fold change (causal reasoning analysis), were then entered into the KPA tool for each protein reactivity group.The lowest p-value was reached when comparing MA samples to "no binding" (FDR=3.65x10-7 ), followed by SB (FDR=2.3x10 - ) and SN (FDR=2.44x10 - ).

Binary classifications of 37 chemicals
A novel dataset comprising 37 well-characterized chemicals (Table 1) was selected in order to complement historical GARD data for 51 chemicals and to represent a relevant choice of chemicals, balanced in terms of chemical reactivity class, and use in consumer products.The novel chemicals were selected based on European Chemical Agency CLP databases and literature (Basketter et al., 2014;Piroird et al., 2015) and in cooperation with the Skin Tolerance Task Force of Cosmetics Europe (CE), who kindly provided 27 chemicals.The 37 novel chemicals were predicted as sensitizers or non-sensitizers using the GPS and previously established protocols based on SVM classifications.The SVM model was applied to assign each individual replicate sample with a SVM DV.Prior to final classifications, SVM DVs from the 37 samples were first calibrated against 11 benchmark samples (Table 3) included in the same sample batch as the test chemicals.For the purpose of evaluating binary predictions, we here decided to prioritize human data (Basketter et al., 2014) when available, where classes one to four correspond to sensitizers and five and six to non-sensitizer, instead of CLP classifications.Model performance predicting the 37 chemicals is summarized by an AUC ROC of 0.88, indicating a good discriminatory ability, and as illustrated in Fig. 1A (benchmark chemicals-filled line; 37 chemicals-dotted line).The sensitivity, specificity and accuracy based on Cooper statistics were estimated to 73%, 80% and 76%, respectively.In combination with previously published data (Johansson et al., 2014) the updated predictive accuracy of GARD for binary classification of skin sensitizers is estimated to 84% based on a dataset comprising a total of 74 chemicals.When the chemicals in Table 1 were grouped according to CLP classification, and the respective SVM DV values obtained during classification of the 37 chemicals were summarized in a boxplot as presented in Fig. 1B, a potency gradient emerged, as the stronger sensitizers were assigned higher SVM DVs in comparison to the weaker sensitizers in category 1B and the non-sensitizers in no cat.According to non-parametric Two Sample Wilcoxon tests comparing SVM DV sample distributions, these groups differed significantly (no cat vs 1B: p=2.8 -5 ; 1B vs 1A p=2.8 -6 , no cat vs 1A p=3.5 -12 ).Although the differences between groups were significant, some overlap existed between individual chemicals, indicating that the information was not sufficient to completely stratify samples into well-defined potency groups.

A random forest model for the prediction of CLP categories
In order to establish a biomarker signature for prediction of CLP categories, the dataset comprising the 37 novel chemicals were merged with the historical dataset comprising 51 chemicals.In total, the dataset consisted of 86 unique chemicals (Table 2) and two vehicle controls, balanced with regards to categories 1A and 1B and non-sensitizers (no cat) as described by CLP (Table 4).For four chemicals CLP classification 1B was changed to no category/non-sensitizer according to the sources indicated in Table 2 for one of three reasons: i) for retaining consistency with previous GARD projects (benzaldehyde, xylene), ii) for being used as vehicle in non-sensitizing concentration (DMSO), and for being a well-described false-positive in the LLNA (sodium dodecyl sulfate).
A random forest model for the prediction of three CLP categories was developed based on a training set consisting of 70 unique samples, including two vehicle controls.From an input of >30000 transcripts based on whole-genome array analysis, 52 predictive variables (transcripts) (Table 5) were identified as optimal for CLP classification.The model's prediction error rate, derived from bootstrapping, was estimated to 0.225, which provides an indication of model performance.In order to visualize the dataset used to develop the model, principal component analysis (PCA) was performed.The 52 variables identified by random forest were used as input, and the PCA was built on the training set (Fig. 2A).

Tab. 5: The 52 variables identified by random forest modelling as optimal for CLP classification
The variables marked in bold text are also found in the GARD GPS.

Prediction of an independent test set
The model performance was further evaluated by predicting the CLP categories of an independent test set, which comprised 18 chemicals previously unseen to the model.The test set colored according to CLP categories was visualized in the PCA plot (Fig. 3A), without influencing the PCA components based on the 52 identified variables.Fig. 3B visualizes the regulation of the 52 transcripts in the normalized test set in form of a heatmap.When replicates were predicted separately and majority voting was used to classify the respective chemicals, 14 out of 18 chemicals were assigned into the right CLP category (Table 6, Table S1 at https://doi.org/10.14573/altex.1701101s),resulting in an overall accuracy of 78% (Table 7).The four misclassified chemicals were diethyl maleate, butyl glycidyl ether, lyral and cyanuric chloride.The only false-negative prediction was cyanuric chloride, which is classified as 1A in CLP and as no cat in our model, whereas the remaining three chemicals are classified as CLP 1B but were predicted as 1A.In a subsequent step to confirm that our selections of training and test set were unbiased and that the predictive model was not entirely dependent on the composition of the training set, we constructed 18 alternative random forest models, where the composition of chemicals in the training and test set were randomly shuffled.For each new model, we repeated the complete process of variable selection as described above.Dependent on the number of replicate samples available for each chemical stimulation, the total number of samples in each training and test set varied, but the number of chemicals in each set and their CLP distribution were kept constant.The alternative models were all significant and the average prediction error rate obtained from the bootstrapping procedure was identical to the initial model at 0.22, which supports that the presented model was not obtained due to a biased choice of training and test sets.

The CLP potency model contains information relevant for human potency prediction
Next, PCA was utilized in order to illustrate how the 52 variables (Table 5) perform using information related to human potency categories (Basketter et al., 2014).Notably, the PCA does not reflect the random forest model, it merely visualizes information contained in the 52 transcripts, and it not possible to apply the CLP potency model directly to predict human potency classes.
The 70 training and 18 test set samples were colored according to human potency (class 1-6, 6=true non-sensitizer, 1=strongest sensitizer), and samples for which no human potency category was available were removed (see Table 2).Although the model has been developed to predict CLP potency categories, the chosen 52 variables also contain information related to human potency as illustrated by PCA visualization in Fig. 4A, B.

Identity of the random forest model variables
The 52 variables identified by random forest (Table 5) represent transcripts belonging to different cell compartments and different functional roles.Five of them overlap with the GARD prediction signature (GPS).Five of the top 10 markers, which were most frequently chosen in the bootstrapping process and have the highest validation call frequencies (Table 5), are histone cluster 1 members, such as HIST1H2AB (Singh et al., 2013).Histones are highly conserved and play an important role not only for maintaining chromatin structure but also in gene regulation (Harshman et al., 2013).HIST1H2AE is both part of the GPS and the here presented potency signature.PFAS, also represented in the GPS) and PAICS are involved in purine biosynthesis (Lane and Fan, 2015;Li et al., 2007), and TMEM97 (part of GPS) is a regulator of cholesterol levels (Bartz et al., 2009), which is further described to be involved in cell cycle regulation, cell migration and invasion in a glioma cell model according to RNA interference experiments (Qiu et al., 2015).DHCR24 (another GPS transcript) represents a multifunctional enzyme localized to the endoplasmic reticulum (ER) and catalyzes the final step in cholesterol-synthesis (Waterham et al., 2001) but possesses also anti-apoptotic activity as for example shown for neuronal cells under ER stress (Lu et al., 2014).PLK1, a kinase, has been shown to be phosphorylated in response to TLR activation and results from RNA interference suggested that PLK1 signaling was involved in the TLR-induced inflammatory response (Hu et al., 2013).PLK1 was further reported to be involved in cell cycle regulation by inhibiting TNF-induced cyclin D1 expression and it could reduce TNFinduced NF-κB activation (Higashimoto et al., 2008).Many of the remaining transcripts are nuclear proteins and thus likely involved in DNA-dependent processes such as replication, transcription, splicing and cell cycle regulation.There are further several transcripts in the signature that code for proteins known for their involvement in immune responses and/or sensitization, ALTEX Online first published April 12, 2017, version 2 https://doi.org/10.14573/altex.1701101such as NQO1, which is well-described for its role in the cellular response to skin sensitizers (Ade et al., 2009) and also part of the GPS.CD53 belongs to the tetraspanin family, transmembrane proteins which have multiple functions in e.g.cell adhesion, migration and signaling and which has been shown to be elevated on FcƐRI-positive skin DCs from atopic dermatitis patients (Peng et al., 2011) similarly as on peripheral blood-derived monocytes from patients with atopic eczema (Jockers and Novak, 2006) in comparison to the respective healthy controls.CD44 is a cell surface glycoprotein, adhesion and hyaluronan receptor (Lee-Sayer et al., 2015) being expressed by numerous cell types and for example involved in inflammatory responses (Johnson and Ruffell, 2009), e.g. by mediating leukocyte migration into inflamed tissues, which has been shown in a mouse model of allergic dermatitis (Gonda et al., 2005).

Common and unique regulated pathways are induced by sensitizers differing in their protein reactivity
The 33 key pathways identified with an input of the 883 most significantly regulated transcripts after a multigroup comparison (FDR=10 -9 ) between CLP categories in KPA analysis (Fig. 5), mirror several of the functional groups of the 52 variables defined by random forest, such as gene regulation, cell cycle control and metabolism.Immune response-associated pathways such as "IL-4-induced regulators of cell growth, survival, differentiation and metabolism" and "IL-3 signaling via JAK/STAT, p38, JNK and NF-κB" were among the 50% most significantly regulated ones.
The analyses described subsequently focused on the three largest chemical reactivity groups in the present dataset; nucleophilic substitution (SN), Michael addition (MA) and Schiff base (SB) formation.Among the included chemicals, the majority of chemicals labeled as "no cat" possessed no protein binding properties; however, a few SB formation and SN chemicals were present.In category 1B, almost all protein reactivity types were represented, whereas there was a clear dominance of MA chemicals in category1A.
For each associated protein reactivity, unique pathways could be identified for sensitizing chemicals belonging to the respective protein reactivity group as presented in Fig. 6.These results combine differentially regulated transcripts from the input data with so-called key hubs, molecules that are able to regulate the expression level of the input transcripts (KPA, 2016).They cannot necessarily be identified themselves by gene expression experiments as their regulation may either be visible on other biological levels, such as activity changes (e.g. for kinases) or the changes may be very short-termed or of low magnitude.In total, 173 transcripts were common for all three reactivity groups (Fig. 7A) and six pathways were present in all three reactivity groups (Fig. 7B); "Cell cycle: Role of APC in cell cycle regulation", "Cell cycle: Role of SCF complex in cell cycle regulation", "Development: Transcription regulation of granulocyte development", "Cell cycle: Cell cycle (generic schema)", "DNA damage: ATM/ATR regulation of G1/S checkpoint", and "Mitogenic action of Estradiol / ESR1 (nuclear) in breast cancer".Again, cell cycle pathways were highly represented.Oxidative stress responses were identified as part of the key pathway results only for MA and SB chemicals (Fig 6).In the MA sensitizer chemical group, KEAP1 and NRF2 were found as key hubs as well as their target genes NQO1 and HMOX1 (Ade et al., 2009;Natsch, 2010) and AHR (Schulz et al., 2013;Kohle and Bock, 2007).For MA chemicals, the target genes NQO1, HMOX1 and CES1 (Roberts et al., 2007) were even present on the input transcripts level.On the input level, CES1 was present for SN chemicals as well, but only NQO1, NRF2 and AHR were identified as key hubs.KEAP1 was not found as key hub in SB and SN KPA analysis and AHR was the only key hub identified for all three protein reactivity groups.NF-κB subunits (RelB or p52) were predicted key hubs in all reactivity groups except in MA chemicals.
In summary, there seem to be common mechanistic responses to chemical exposures per se such as cell cycle and DNA damage-related, but the pathway analysis results also support the hypothesis that different chemical reactivity classes induce distinct signaling pathways as observed earlier by us (Albrekt et al., 2014) and in other experimental systems (Cottrez et al., 2015;Migdal et al., 2013;Natsch et al., 2015;Chipinda et al., 2011).Several pathways are linked to processes known to be relevant in skin sensitization.

Discussion
The amount of chemical per exposed skin area that induces sensitization varies significantly (Basketter et al., 2014) among chemicals; thus, skin sensitizer potency information is imperative for accurate risk assessment.Developers of alternative test method should rely on human clinical data in order to achieve high predictivity of human sensitization, however, this type of data is rather scarce and most available data is derived from the LLNA (Basketter et al., 2009).Despite the fact that animal models reflect the complexity of systemic diseases such as skin sensitization, in vitro data has so far shown to correlate well with and to perform equally well or even better than animal models, especially when combined in an ITS (Urbisch et al., 2015).Furthermore, alternative test systems may provide mechanistic insights that tests using whole animals cannot provide (Natsch et al., 2010).
Here, an approach to predict skin sensitizer potency is presented using the CLP system based on a dendritic cell (DC) model and transcriptional profiling.CLP categories are empirically determined and arbitrarily defined categories, which do not represent the diversity of different chemicals, their molecular features and mechanisms responsible for their sensitizing characteristics or the lack thereof.They are, however, what legislation currently requires in order to classify and label chemicals.We therefore investigated 37 additional chemicals previously not tested in the standard GARD assay, in order to combine these new data with historical datasets.In the binary classifications of these new chemicals according to the established GARD model, four misclassified sensitizing chemicals were close to the cut-off as defined by the benchmark samples, namely aniline, benzocaine, limonene and butyl glycidyl ether.Three of these belong to human potency class 4, which shows that the model cut-off is critical in order to translate the SVM values, often correlating well with potency, into accurate classifications.Together with historical predictions, GARD shows an overall accuracy of 84% for binary classifications.
We then used both new and historical data in order to develop a random forest model for each CLP category, which displays balanced accuracies (Brodersen et al., 2010) of 96% for no category, 79% for category 1A and 75% for category 1B (based on majority votes, for performance on replicate basis see Table 7).Butyl glycidyl ether, diethyl maleate, cyanuric chloride and lyral were misclassified; the only false-negative prediction was no category instead of 1A for cyanuric chloride.However, cyanuric chloride reacts exothermally with water forming hydrogen chloride and possibly other reaction products.Due to this hydrolyzation reaction, probably already occurring in DMSO (containing water), the amount of cyanuric chloride and reaction products present in the assay are unknown.This chemical may fall outside the applicability domain of GARD platform-based assays.However, nothing in the quality control or other pre-modelling analyses motivated a removal of these samples.Diethyl maleate and lyral are classified as 1B by CLP, but as 1A by our model, which again seems to fit more to their human potency category, category 2, as described by Basketter et al. (Basketter et al., 2014).The forth misclassified chemical butyl glycidyl ether is a human potency category 3. Obviously, predicting 1B, i.e. weak sensitizers seemed the most challenging part.Also in the LLNA, potency predictions of weak sensitizers vary more than those of strong sensitizers (Dumont et al., 2016;Hoffmann, 2015;Ezendam et al., 2016).Furthermore, 1B is a very heterogeneous group, both considering the range of LLNA EC3 concentrations (> 2% (CLP, 2016)) and human potency categories associated to chemicals summarized in category 1B.In the here presented dataset, chemicals in category 1B belong to potency classes 2 to 5. Five of the 52 transcripts forming the potency prediction signature developed to distinguish three categories, are also part of the GPS, which consists of in total 200 transcripts in order to predict two classes (Johansson et al., 2011).This degree of overlap may at first appear low, but considering a) the inherent differences in the development of the two models, i.e. the aim, in chemicals used as training set, in the applied bioinformatical methods, and b) the known "multiplicity problem" (Díaz-Uriarte and Alvarez de Andrés, 2006), it is not surprising.In short, the "multiplicity problem" describes the fact that variable selection with microarray data can result in several, equally well predictive models, in spite of sharing very few genes.
The U-SENS™ assay, formerly MUSST, uses another myeloid cell line, U937, and CD86 measurements in order to distinguish sensitizers and non-sensitizers.When the authors combined CD86 with cytotoxicity data and certain cut-off levels in order to predict CLP categories, correct predictions of 82% of Cat.1A (41/50) and 73% of Cat.1B/No Cat (85/116) were reported (Piroird et al., 2015).However, it remains unclear how the more challenging discrimination between no cat and 1B would turn out.Cottrez et al. (Cottrez et al., 2016) have recently published a study, where they report that their alternative assay SENS-IS, a 3D reconstituted epidermis based model, performs very well for the prediction of skin sensitizer potency; however, they do not target CLP categories.Judging from Fig. 4, purely based on the 52 variable input, which was defined in order to predict CLP categories, also our model seems to contain information relevant for human potency classification.Once more chemicals receive human potency classifications, it should be possible to smoothly develop a human potency model based on the GARD platform.
KPA pathway analysis identified biologically relevant events in the presented dataset as several pathways regulated have a known role in skin sensitization, e.g.cytokine signalling and oxidative stress responses (Figs.5-6).Although DCs are not the primary target for protein modification in vivo, we hypothesized that different protein reactivity classes influence the DC transcriptome differentially.Protein reactivity is one of the most important features of chemicals defining their skin sensitizing capacity and potency with certain limitations (Chipinda et al., 2011).Protein reactivity-specific patterns were detectable as revealed by the comparison of the most significantly regulated transcripts induced by the reactivity groups MA, SB, and SN.Interestingly, NF-κB subunits were predicted key hubs for all reactivity groups except for MA chemicals, which may reflect the described inhibitory effect on NF-κB signalling of this type of chemicals (Natsch et al., 2011).Although some pathways do not seem to fit into the context, such as "Mitogenic action of Estradiol / ESR1 (nuclear) in breast cancer", a closer look at regulated molecules reveals that those are certainly relevant also for other pathways.In this case, for example p21, cmyc, E2F1, SGOL2 (shugoshin 2), and CAD (carbamoyl phosphate synthetase) were involved, whereof the first ones are known cell cycle regulators/transcription factors (Buchmann et al., 1998) and play a role in chromosome segregation (SGOL2) (Xu et al., 2009).CAD, an enzyme, which is rate-limiting in the biosynthesis of pyrimidine nucleotides, on the other hand, has more recently also been implicated in cooperation with cell signaling pathways (Huang and Graves, 2003) and seems to inhibit the bacterial sensor NOD2 (nucleotide-binding oligomerization domain 2) antibacterial function in human intestinal epithelial cells (Richmond et al., 2012).These examples may illustrate that our transcriptomic data deserves further attention and more detailed analyses and this type of analysis, using different bioinformatics tools and finally, functional analyses, may contribute to elucidate mechanisms underlying biological processes and diseases.
As already discussed above, assigning correct potency classes to chemicals with weak or intermediate potency seems to be a more general problem.Benigni et al. (Benigni et al., 2016) presented data showing that even experimental in vivo systems, though in general correlating well with human data, perform less well for sensitizers of intermediate potency.They further argue that the protein modification step is the rate-limiting step of the whole sensitization process and that in vitro tests targeting other AOP events do not add much information.On the other hand, there is only a weak relationship between the rate constant of MA sensitizers as determined by kinetic profiling with a model peptide and their potency in the LLNA (Natsch et al., 2011).This was described to be linked to the anti-inflammatory effect of MA chemicals by inhibiting NF-κB signalling, which increases with reactivity.However, considering the key events in the skin sensitization AOP, the sensitization process as such can be understood as a continuum and would thus not just be characterized by isolated events.Of course, in reality the process must start with the chemical penetrating the skin.In this context it should be noted that the concept that a chemical's ability to efficiently penetrate the stratum corneum is crucial for its skin sensitization capacity and potency has recently been proven wrong (Fitzpatrick et al., 2016).Furthermore, access to lower skin layers may in reality be greatly facilitated by impaired skin barrier function, due to e.g.wet work and small wounds.Interestingly, many strong sensitizers possess irritant properties, which correlate with cytotoxicity, and cytotoxicity in turn seems to contribute to sensitizer potency (Jaworska et al., 2015), which is also reflected in our dataset (data not shown).Cytotoxicity is also connected to the protein reactivity of the chemical: chemicals, which are strongly cysteine-reactive are in general cytotoxic, and thus may interfere with vital enzyme function (Bohme et al., 2009;Jaworska et al., 2015).At the same time, irritant effects can generate danger signals such as extracellular ATP or hyaluronic acid degradation products (Esser et al., 2012;Martin et al., 2011), which may serve to activate DCs and consequently, to prime naïve T cells.Additionally, other factors such as pre-existing inflammation and co-exposures to other substances, which certainly play a role in allergic sensitization, may be hard to implement in any test system.As in vitro assays further have to face the demand of being cost-effective and easy to perform, there are obvious limitations what can be achieved in vitro, but the performances of tests so far are very encouraging (Benigni et al., 2016).Although protein reactivity may be very important, cell-based systems should be capable of recapitulating certain additional events on top of peptide reactivity.Alternative assay performance can most likely be further improved as more mechanistic details of skin sensitization are revealed, which will allow identifying both applicability domains and pitfalls more easily.
In conclusion, we have identified a predictive biomarker signature comprising 52 transcripts for classification of skin sensitizing compounds into CLP groups, as required by current legislation.When challenged with 18 independent test compounds, the assay provided accurate results for 78% of potency predictions.It further identified 11/12 sensitizers correctly, which indicates that it is rather conservative, i.e. avoids false-negative predictions.Since the presented biomarker signature is optimized for potency predictions, and not only for binary hazard classifications, we suggest a possible application for our model within an Integrated Testing Strategy for accurate potency predictions, similar as suggested by (Jaworska et al., 2015).Importantly, hazard and risk characterization can be provided based on a single experimental campaign and only requires and additional bioinformatical analysis step.In addition, the here discussed results effectively illustrate the flexibility and versatility of the GARD setup.Measuring complete transcriptomes of cells provides the opportunity to perform mode-of-action based studies to identify pathways of sensitization, but also to identify predictive biomarker signatures, which we have previously shown also for binary skin sensitization predictions and respiratory sensitization.In addition, the flexibility of the GARD platform will allow for fine-tuning of prediction models as new groups of chemicals are analyzed.Thus, we here present an initial proof of concept of a potency model targeting the CLP groups, which can be modified and improved as more samples are analyzed and more accurate human reference data emerges.
Fig. 2A is based on chemicals with biological replicates colored according to CLP categories and a clear gradient from no cat to strong sensitizers (1A) can be observed along the first principal component.The heatmap of the training set with hierarchical clustering of the variables in Fig. 2B illustrates the regulation of transcripts in relation to the respective chemical and CLP category.

Fig. 2 :
Fig. 2: Visualization of the training dataset used to develop the random forest model, using the 52 transcripts as input variables (A) PCA plot of the training set (n=70) with separate biological replicates (n=254) coloured according to CLP classifications of the chemicals.(B) Heatmap of the training set with replicates of the samples (x axis) hierarchically clustered, where the grey scale represents the relative transcript expression intensity (y axis).

Fig. 3 :
Fig. 3: Visualization of the test dataset (n=18) using the 52 transcripts (A) PCA plot of the test set with separate biological replicates (n=3) coloured according to CLP classifications.The PCA was built on the training set and the test set plotted without influencing the PCA.(B) Heatmap of the test set with samples (x axis) hierarchically clustered, where the grey scale represents relative transcript expression intensity (y axis).

Fig. 5 :
Fig. 5: Pathway analysis based on an input of the 883 most significant transcripts from a multigroup comparison of the three CLP classes using the Key Pathway Advisor tool

Fig. 6 :
Fig. 6: Pathways unique for each protein reactivity groups, as identified by pathway analysis using the Key Pathway Advisor tool

ALTEX Online first published April 12, 2017, version 2 https://doi.org/10.14573/altex.1701101
. Prior to model construction, potential batch effects between training set and test chemicals were eliminated by scaling array expression values for test chemicals 6against the training set.

: Statistics by class for separate replicates in the test set predictions
Three biological replicates per chemical; for replicate predictions see Tab.S1 at https://doi.org/10.14573/altex.1701101s.GARD misclassifications are highlighted in italics.Six chemicals per category; three biological replicates each; n (test set) = 18.