Evaluation of the Global Performance of Eight In Silico Skin Sensitization Models Using Human Data

Allergic contact dermatitis, or the clinical manifestation of skin sensitization, is a leading occupational hazard. Several testing approaches exist to assess skin sensitization, but in silico models are perhaps the most advantageous due to their high speed and low-cost results. Many in silico skin sensitization models exist, though many have only been tested against results from animal studies (e.g., LLNA); this creates uncertainty in human skin sensitization assessments in both a screening and regulatory context. This project’s aim was to evaluate the accuracy of eight in silico skin sensitization models against two human data sets: one highly curated (Basketter et al., 2014) and one screening level (HSDB). The binary skin sensitization status of each chemical in each of the two data sets was compared to the prediction from eight in silico skin sensitization tools (Toxtree, PredSkin, OECD’s QSAR Toolbox, UL’s REACHAcross™, Danish QSAR Database, TIMES-SS, and Lhasa Limited’s Derek Nexus). Models were assessed for coverage, accuracy, sensitivity, and specificity, as well as optimization features (e.g., probability of accuracy, applicability domain, etc.), if available. While there was a wide range of sensitivity and specificity, the models generally performed comparably to the LLNA in predicting human skin sensitization status (i.e., approximately 70–80% accuracy). Additionally, the models did not mispredict the same compounds, suggesting there might be an advantage in combining models. In silico skin sensitization models offer accurate and useful insights in a screening context; however, further improvements are necessary so these models may be considered fully reliable for regulatory applications.

tractive tools for hazard evaluation, particularly in a screening context, as chemical and product manufacturers seek methods to quickly and cost-effectively identify skin-sensitizing chemicals to protect both workers and consumers (Cronin, 2010;Cronin and Madden, 2010;Maertens et al., 2014;Maertens and Hartung, 2018).
Due to ethical considerations, publicly available skin sensitization data in humans are generally limited (Basketter, 2009); therefore, the development of in silico skin sensitization models has relied heavily on data from animal studies, as well as in vitro studies. Numerous reviews have evaluated the performance of in silico skin sensitization models against animal data (Chaudry et al., 2010;Bauch et al., 2012;Teubner et al., 2013;Kostal and Vouchkova-Kostal, 2016;Verheyen et al., 2017;Fitzpatrick et al., 2018); however, less consideration has been given to how in silico models predict skin sensitization against human data (Bauch et al., 2012;Verheyen et al., 2017;Luechtefeld et al., 2018a). This analysis aims to: (i) assess the availability and quality of human skin sensitization data for hazard screening assessments, and (ii) identify strengths and weaknesses of current in silico skin sensitization models so that they may be properly applied and improved for future use in human hazard assessment.

Data sets
A set of 131 chemicals and their corresponding human skin sensitization statuses compiled by Basketter et al. (2014) were used in this assessment (see Tab. S1 1 ). The skin sensitization statuses assigned in Basketter et al. (2014) were largely based on human predictive and diagnostic patch test data, and the potencies ranged from Human Category 1 to Human Category 6, with chemicals in Category 1 being the most potent sensitizers and chemicals in Category 5 being weak sensitizers; chemicals in Category 6 were defined as non-sensitizers. For this analysis, the categories were converted to binary outcomes (i.e., "sensitizer" or "non-sensitizer"), i.e., Category 1 through 5 were converted to sensitizers, whereas Category 6 chemicals were converted to non-sensitizers. See Basketter et al. (2014) for additional details on the definitions of the initial assignment of Categories 1 through 6.
The Chemical Abstract Services Registry Numbers (CASRNs) were taken from Basketter et al. (2014), and the Simplified Molecular Input Line Entry System (SMILES) strings were obtained from ChemIDplus Advanced 2 . No SMILES string was available in ChemIDplus Advanced for 8 of the 131 chemicals in the Basketter et al. (2014) data set; however, in order to control for SMILES string variation, only ChemIDplus Advanced was used as a source for SMILES strings. Nevertheless, these 8 chemicals were still included in the data set because some of the models can utilize CASRN as an input.
into question (NRC, 2007;Hartung, 2008Hartung, , 2009Leist and Hartung, 2013;Pound and Bracken, 2014;Bailey et al., 2014Bailey et al., , 2015Akhtar, 2015). This is a burden to regulators and manufacturers alike, as the combination of increased restrictions on animal testing as well as the high cost and lengthy turnaround time of traditional animal methods makes them increasingly unsuitable for chemical hazard assessment.
Because of increased concerns about animal welfare on the part of both consumers and regulatory agencies, there has been increased emphasis on developing non-animal test methods to predict skin sensitization as an alternative to the LLNA (Basketter et al., 2012). Several types of non-animal alternative methods have been developed to address these issues, including a) tests for direct chemical and biochemical reactivity (in chemico), b) tests for effects in cultured cells, tissues, and organs (in vitro), and c) computer-modeled prediction of hazards (in silico). The direct peptide reactivity assay (DPRA) is an in chemico method used to predict the reactivity of the target chemical with skin proteins, which is the molecular initiating event (MIE) in the skin sensitization AOP (OECD, 2019). Validated in vitro models include the ARE-NrF2 luciferase test method (KeratinoSens™ and LuSens) (OECD, 2018a), which measures the activation of keratinocytes, the second key event in the AOP. The human cell line activation test (h-CLAT), the U937 cell line activation test (U-SENS™), and the interleukin-8 reporter gene assay (IL-8) (OECD, 2018b) are also validated in vitro methods that evaluate activation of dendritic cells -the third key event in the AOP for skin sensitization. Available in silico models are numerous (Wilm et al., 2018;Alves et al., 2018) and include those that assess structural alerts (Enoch et al., 2011), use a read-across approach (Hartung, 2016;Luechtefeld and Hartung, 2017), or employ quantitative structure activity relationships ((Q)SARs) (Chaudhry et al., 2010;Alves et al., 2016a). Finally, combinations of these tools can be used to capture the various steps in the skin sensitization AOP in what is referred to as a defined approach (DA) (OECD, 2016). Most DAs are hazard-based schemes that may include sequential testing strategies (STS) or integrated testing strategies (ITS) . As the names imply, the STS is a stepwise approach in which the user can make a hazard prediction at any point in the strategy, while an ITS employs all steps before a hazard prediction is made.
In chemico and in vitro models provide valuable information; however, these methods are restricted to laboratories, which reduces their convenience. DAs are a relatively new concept, and, while they are gaining momentum due to improved accuracy over traditional approaches to evaluate skin sensitization , only a limited number of well-described ITS are publicly available (Johansson and Lindstedt, 2014;Luechtefeld et al., 2015;Jaworska, 2016;Macmillan and Chilton, 2019). In silico models are generally less expensive and faster (Raunio, 2011) and can be implemented almost ubiquitously, unlike their laboratory-based counterparts. This makes them at-Additionally, a manually curated human skin sensitization data set of 375 chemicals was prepared by querying the Hazardous Substances Data Bank (HSDB) for chemicals associated with the search term "skin sensitization" 3 (see Tab. S2 1 ). HSDB is a peer-reviewed, toxicological database supplemented with information such as human and environmental exposures, industrial hygiene, and regulatory requirements. HSDB data are reported as study summaries, warning statements, or other abbreviated hazard records. It was selected as a literature source due to its large, centralized collection of human data and relevance to occupational exposures.
"Skin sensitization" was used as the search term in the HSDB to identify human skin sensitization data. This search produced 923 discrete chemical records. Each occurrence of "skin sensitization" or "sensitization" in each record was reviewed; this led to the review of 1,264 summaries (some chemical records had multiple occurrences of the terms "sensitization" or "skin sensitization"). In HSDB, summaries of various information sources, rather than complete dossiers, data sheets, etc., are provided. Information sources summarized include case reports, drug warnings, epidemiology studies, human exposure studies, safety data sheets, etc. If it was not clear from the summary whether a skin sensitization study was exclusively based on human data, it was not included in the final data set. This review process eliminated 411 chemicals, as these contained no human data, leaving 512 chemicals with human data. In order to have a discrete identifier, chemicals without a CASRN were not included in the final data set, which eliminated 13 of the 512 chemicals, leaving 499 chemicals with human data and a CASRN.
Finally, from these 499 chemicals, 124 chemicals were eliminated due to incomplete or unclear summaries. Incomplete or unclear summaries were those that were considered to have insufficient information to characterize the skin sensitization status of the chemical. For example, if route of exposure was not clearly given as dermal in the summary, the chemical was not included in the final data set. Additionally, if a chemical appeared to be administered as part of a mixture, it was not included in the final data set. This led to a final data set of 375 chemicals with human skin sensitization data on which the sensitization status of a chemical could be based. Chemicals were considered to be skin sensitizers if there was evidence of sensitization; consequently, the chemicals in this data set likely capture a range of sensitization potency. Chemicals were considered to be non-sensitizers if the authors concluded no sensitization reaction was observed. The CASRNs were taken from the HSDB entry for each chemical, and the SMILES strings were again obtained solely from ChemIDplus Advanced 2 . Of the 375 chemicals in the HSDB data set, SMILES strings were unavailable for 14 chemicals. Again, these 14 chemicals remained in the data set, as some models are capable of using the CASRN as an input.
The Basketter et al. (2014) chemical data set represents a data set that was culled from the literature and critically reviewed by experts; in contradistinction, the HSDB data set represents the quality and availability of data typically used when carrying out a chemical hazard screening assessment and is not as precise in distinguishing skin sensitization potency.

CLP comparison
To assess the concordance of our skin sensitization statuses assigned to the Basketter et al. (2014) and HSDB data sets with the more standardized approach adopted in the Classification, Labelling, and Packaging (CLP) Regulation ((EC) No 1272/2008), we compared our data with the harmonized skin sensitization classifications as assigned according to the CLP (ECHA, 2018a). The CLP Regulation requires chemical manufacturers to classify the hazard of their product, communicate this hazard through appropriate labeling, and package the product in a manner that addresses the hazards. To date, the European Chemicals Agency (ECHA) has assigned hazard classifications for several thousand chemicals (i.e., harmonized classifications), and the remaining chemicals available in commerce must be self-classified (ECHA, 2018b).

Model performance
The skin sensitization status of all chemicals from both data sets was then predicted via 8 in silico models using either the CASRN or the SMILES string for each chemical, depending on the required input of each model. An overview of each model is summarized in Table 1.
The chemicals from the two data sets were evaluated in batch mode for all models, with the exception of REACHAcross™ (Luechtefeld et al., 2018b): The version used in this analysis does not offer a batch mode, so the chemicals were processed individually using CASRNs. It should be noted that some of the chemicals have structures with multi-position substituents; consequently, their SMILES strings are unspecified. Some of the models that use SMILES strings as the input parameter (i.e., QSAR Toolbox, CAESAR, TIMES-SS, and Derek Nexus (Derek)) have a strict SMILES string interpretation, and, as a result, some of the ambiguous SMILES strings in the data set were not predicted.
The global performance of all models was evaluated by assessing the following parameters: coverage, sensitivity, specificity, false predictions, accuracy, and balanced accuracy. For false predictions, the chemical class, partition coefficient, and molecular weight were evaluated to identify patterns of false predictions (if any) in the models. The chemical class for each chemical was assigned using the U.S. EPA New Chemical Category from QSAR Toolbox. The octanol:water partition coefficients were identified using the KOWWIN model in U.S. EPA's EPISuite, and the molecular weights were obtained from ChemIDplus Advanced 2 .
Some of the models offer additional information about the prediction. PredSkin and REACHAcross™ offer an associated probability of accuracy with the returned prediction, and the distribution of these probabilities was assessed to evaluate the confidence in the predictions provided by the models. Additionally, (2014) data set should be considered sensitizers. Skin sensitization has been reported following exposure to these chemicals; however, the data available are not sufficient to classify these chemicals as skin sensitizers according to GHS (see Basketter et al., 2014 for additional details). Therefore, a separate explo-TIMES-SS is highly dependent on metabolism, so a separate assessment of the effect of metabolism on prediction accuracy was included in this evaluation.
Finally, there is some debate as to whether the Human Category 5 chemicals (i.e., rare sensitizers) from the Basketter et al.

Tab. 1: Overview of the models
Summary of the administrative information, availability, and methodology of each in silico model as well as whether it predicts binary or potency outcomes.

Results*
Predicts whether chemical will be a skin sensitizer (Binary results for human prediction)

Identifies structural alerts within target chemical
Predicts whether chemical will be a skin sensitizer (Binary results*) Predicts whether chemical will be a skin sensitizer (Binary results for human prediction) Predicts whether chemical will be a skin sensitizer (Binary results) Predicts whether chemical will be a skin sensitizer (Binary results*) Predicts whether chemical will be a skin sensitizer (Binary results*) Predicts whether chemical will be a skin sensitizer based on structural alerts (Binary results*)

Structural alerts
Read-across/ skin metabolism simulation Battery algorithm based on three individual QSAR models QSAR QSAR/ read-across QSAR/ skin metabolism simulation Expert knowledge-based system IRCCS, Instituto di Ricerche Farmacologiche Mario Negri; LMC, Laboratory of Mathematical Chemistry; NFI, National Food Institute, Technical University of Denmark; UL, Underwriters Laboratories. *Binary results were used in this analysis. TIMES-SS potency results were converted to binary results. QSAR Toolbox and Derek Nexus offer potency scales in addition to binary outputs; however, binary results were used for this analysis. REACHAcross™ offers predictions in terms of GHS Categories; however, binary results were used for this analysis. **Chemicals were evaluated against the skin sensitization endpoint using mammal as the selected species. Perceive tautomers, perceive mixtures, and match alerts without rules options were unselected. ration was performed in this analysis in which these chemicals were considered non-sensitizers, and the effect on model accuracy was evaluated.

Data sets
The Basketter et al. (2014) data set consists of 131 chemicals, of which approximately 80% are sensitizers representing a wide range of potencies and 20% are non-sensitizers (see Tab. 2). In the HSDB data set (n = 375 chemicals), approximately 60% of the chemicals are sensitizers and 40% are non-sensitizers.
Chemical summaries in the final HSDB data set were drawn from roughly 16 source types; however, 80% of the skin sensitization summaries were extracted from only three source types: Human Exposure Studies, Human Toxicity Excerpts, and Signs and Symptoms. The distribution of all data sources is provided in Figure 1.
Hazard data availability for the HSDB data set is summarized in Figure 2. For over 60% of the chemicals in the final HSDB data set, only one summary was available. Overall, there were few chemicals with multiple summaries: only 18 chemicals had 5 summaries or more available.

CLP comparison
Less than half of the chemicals in each of the data sets had harmonized skin sensitization classifications in the CLP, as illustrated in Table 3. When the CLP harmonized classifications were compared to classifications in Basketter et al. (2014), the concordance was 78%, which was higher than the concordance of the CLP classifications compared to the HSDB data set at 64%. Basketter

Fig. 1: Distribution of the HSDB entries by data source for all chemicals in the final HSDB data set
The HSDB entries summarize skin sensitization studies in humans and categorize them according to source type. Those source types that accounted for ≤ 1% of all source types were combined in "Other" for clarity. while the HSDB data set distribution is simply binary. If a SMILES string was unavailable, this indicates that it was not available in ChemIDplus Advanced. A CASRN issue indicates that a chemical may be represented by more than one CASRN; in this case, a representative CASRN was used.  (2014) as rare sensitizers (Human Category 5); these chemicals had low concordance (40%).

Model performance
Model performance was assessed based on model coverage, accuracy, balanced accuracy, sensitivity, and specificity. Base model settings were defined as the prediction returned by simply entering the input parameter (i.e., CASRN or SMILES) into the model; this was considered the base result. To better illustrate the capabilities of each model, optimal model settings were also used to assess all performance measures, and the prediction produced by incorporating these optimal settings was the high confidence result. The optimization features included applicability domain, probability of accuracy, and metabolism, if available. Only high confidence results are presented for the model performance metrics except for coverage; both base and high confidence results are presented for coverage. Optimization settings (i.e., the settings applied to obtain a high confidence result) for each model are shown in Table 5. For PredSkin and REACHAcross™, a high confidence result was one that fell within the applicability domain and had a probability of accuracy greater than 70%. For QSAR Toolbox, the prediction had to be within the applicability domain; both the base and optimized settings for QSAR Toolbox have the same coverage because the automated batch mode applies the same settings universally. For the Danish QSAR Database, CAESAR, and TIMES-SS, a high confidence result was defined as a prediction falling within the applicability domain. A high confidence result for Derek was defined as every prediction with the exception of "non-sensitizer with misclassified and/or unclassified features." Toxtree is a structural alert model and, therefore, does not have any optimization features, so the model output was considered to be the high confidence result.
Of note, the TIMES-SS model provides predictions in potency format; however, for this assessment, the results were converted to binary format. Additionally, Derek provides likelihood levels for alerting chemicals, which were converted to binary format (certain, probable, plausible, equivocal = sensitizer; doubted, improb-et al. (2014) had a similar percent concordant positive (100% and 92%, respectively) and percent concordant negative (45% and 41%, respectively) to HSDB. The HSDB data set had a discordant positive rate of 59%, which was also very similar to Basketter et al. (2014) at 55%. No discordant negatives were assigned in the Basketter et al. (2014) data set, and a low number of discordant negatives were assigned in the HSDB data set (8%).

Data set comparison
Of the 131 and 375 chemicals assessed in Basketter et al. (2014) and HSDB, respectively, only 8% of the chemicals (n = 41) occurred in both data sets. The concordance for all chemicals with overlap was 78%. The binary concordance between the Basketter et al. (2014) and HSDB data sets are shown by potency category, as assigned in Basketter et al. (2014), in Table 4. The chem-

Tab. 3: Concordance of the skin sensitization status for both the Basketter et al. (2014) and HSDB data sets compared to the harmonized CLP skin sensitization classification statuses
To assess the concordance of the skin sensitization statuses in each data set with the more standardized approach adopted in the CLP, we compared each data set with the harmonized skin sensitization classifications as assigned according to the CLP (ECHA, 2018a). For both data sets, there were more sensitizers than non-sensitizers, and, therefore, the data set is unbalanced; as a result, balanced accuracy (Fig. 5) is a better metric for model performance. The balanced accuracy range of the models following evaluation of the Basketter et al. (2014) data set was 33% while the balanced accuracy range of the models after evaluation of the HSDB data set was 26%. Overall, TIMES-SS had the highest balanced accuracy for the Basketter et al. (2014) data set at 88%, but, as with model accuracy, other models had balanced accuracies that were comparable, including Derek (86%), REACHAcross™ (83%), and CAESAR (83%). PredSkin had the lowest balanced accuracy for the Basketter et al. (2014) data set at 55%.

Data source
For the HSDB data set, REACHAcross™ and the Danish QSAR Database both had the highest balanced accuracy at 78%, but TIMES-SS and Derek had similar balanced accuracies at 72% and 70%, respectively. PredSkin, again, had the lowest balanced accuracy at 52%.
With the exception of Danish QSAR Database and Toxtree, which had fairly low sensitivity, the majority of models exhibited similar sensitivities (75%-90%); PredSkin had the highest sensitivity at 98% for the Basketter et al. (2014) data set (see Fig. 6A). Model specificity had a much wider range (87%) for the Basketter able, impossible = non-sensitizer) for this assessment. REACH-Across™ provides results in both binary and GHS Category format; for this assessment, the binary results were used. QSAR Toolbox has the potential to predict skin potency values (i.e., EC3 values) in the automated workflow; however, the binary results from the automated workflow were used for this assessment. The use of binary results in the context of this assessment allowed for easier comparison of global performance across all models.
Model coverage for base and optimized settings using the HSDB data set is illustrated in Figure 3. Using base settings, the model coverages were similar for several of the models: PredSkin, REACHAcross™, CAESAR, TIMES-SS, and Derek all had base coverages above 85%. The Danish QSAR Database had a base model coverage of 65%, while the QSAR Toolbox had a base coverage of 53%. However, it should be noted that the QSAR Toolbox coverage was the same under base and optimized conditions because the same settings are always applied for the automated batch method used in this model. For the optimized settings, Derek had the greatest model coverage at 90% for the HSDB data set. The other models produced coverages ranging from 30% to approximately 50%. The model coverages for the Basketter et al. (2014) data set (both base and optimal settings) can be found in Figure S1 1 , and the number of chemicals entered into each model and the number of predictions returned for both data sets can be found in Table S3 1 . Accuracy and balanced accuracy both had wide ranges across the models; however, the majority of the models tended to perform comparably, as demonstrated by the similarity in accuracy and balanced accuracy for many of the models. As shown in Fig

Tab. 5: Summary of optimized settings by model
For each model, optimized settings were applied to obtain a high confidence result. Additionally, most models provided predictions in binary form; however, some models provided predictions in potency format, and those predictions were converted to binary format.

Model Optimized settings Prediction format PredSkin
Within applicability domain; Binary probability of accuracy > 70%

QSAR Toolbox
Within applicability domain; Binary (in automated workflow) metabolism incorporated

REACHAcross™
Within applicability domain; Binary and GHS Category format; probability of accuracy > 70% binary used for this analysis

TIMES-SS
Within applicability domain; Potency; metabolism incorporated converted to binary for this analysis

Derek
All predictions with the exception of non-sensitizer Likelihood level; with misclassified and/or unclassified converted to binary for this analysis *Toxtree is a structural alert model and, therefore, does not have any optimization features. Consequently, the model output was considered to be the high confidence result without any further optimization.
et al. (2014) data set. TIMES-SS had perfect specificity at 100%, while PredSkin had the lowest specificity at 13%. QSAR Toolbox also had relatively low specificity at 45%, while the remaining models had specificities ranging from approximately 70% to almost 90%. Figure 6A presents the sensitivity and specificity of each model as well as model coverage (indicated by the number inside each symbol). The specific percentages for sensitivity and specificity can be found in Table S4 1 , and the false prediction percentages for each model can be found in Table S5 1 . The range of model sensitivity for the HSDB data set was 45%. PredSkin, again, achieved the highest sensitivity at 100% while Toxtree had the lowest sensitivity at 55%. The sensitivities of the remaining models were moderate to high, ranging from approximately 70% to nearly 90%.
Like the Basketter et al. (2014) data set, model specificity for the HSDB data set had a much wider range than sensitivity at 80%. The Danish QSAR Database had the highest specificity at 84% while PredSkin, again, had the lowest specificity at 4%. CAESAR and, again, QSAR Toolbox had relatively low specificity at 25% and 47%, respectively, while the other models had specificities ranging from 64% to 68%. The sensitivity and specificity of each model for the HSDB data set is depicted in Figure 6B, along with the model coverages (indicated by the values within each symbol). The specific percentages for sensitivity and specificity can be found in Table S4 1 , and the false prediction percentages for each model can be found in Table S5 1 .

Fig. 3: Base and high confidence model coverage for HSDB data set
Base results (calculated using base settings) and high confidence results (calculated using optimized settings) for model coverage calculated by evaluating the number of returned predictions over the total number of chemicals with model input parameters. Base results were calculated by assessing the number of predictions produced without applying optimization features. High confidence results were obtained by implementing optimization features. Toxtree does not have any optimization features; consequently, the model coverage, which is 100% under both conditions due to its structural alert nature, is not shown here.

Fig. 4: Model accuracy for high confidence results
For the Basketter et al. (2014) data set, model accuracy was calculated by assessing concordance between the sensitization status assigned in Basketter et al. (2014) and the skin sensitization prediction (using optimized settings) from each model. For the HSDB data set, model accuracy was calculated by assessing concordance between the HSDB skin sensitization status and the skin sensitization prediction (using optimized settings) from each model. Only chemicals with a returned prediction were included in the accuracy assessments. The HSDB data set incorporated a larger variety of chemical categories (n = 45). The distribution of these categories is shown in Figure 8. Again, the majority of chemicals were classified as "Not Categorized". Neutral organics, esters, aldehydes, and aliphatic amines were among the chemical categories that occurred with the highest frequencies in the HSDB data set.
Mispredictions were assessed by chemical category to determine if they were falsely predicted at an increased frequency due to is-To determine if there were any patterns in incorrect predictions, mispredictions were assessed by chemical category, partition coefficient, and molecular weight. The chemical category (assigned using the U.S. EPA New Chemical Category database in QSAR Toolbox) distribution for the Basketter et al. (2014) data set is shown in Figure  7. A total of 28 chemical categories were identified for the Basketter et al. (2014) data set. The majority of chemicals were classified as "Not Categorized", meaning that they did not meet the chemical definitions of any chemical category as defined under the U.S. EPA New Chemicals Program. Chemical categories that occurred with  were assigned using the U.S. EPA New Chemicals Categorization tool in the QSAR Toolbox. If more than one functional group is identified in a category, it is because the chemical has both functional groups. Chemical categories that account for < 3% of the total chemical category distribution are not shown in the figure for clarity.

Fig. 8: Distribution by U.S. EPA New Chemicals Category for the HSDB data set
Chemical categories for the HSDB data set (n = 375) were assigned using the U.S. EPA New Chemicals Categorization tool in the QSAR Toolbox. If more than one functional group is identified in a category, it is because the chemical has both functional groups. Chemical categories that account for < 3% of the total chemical category distribution are not shown in the figure for clarity. a log K OW ≥ 5 (Fig. S2-S5 1 ), so we did not perform a sub-analysis of those features. However, more robust analysis of larger data sets has demonstrated that the molecular weight cut-off reduces the probability that a chemical is a skin sensitizer but does not preclude it (Gerberick et al., 2005;Fitzpatrick et al., 2016;Luechtefeld et al., 2016).
Two of the models include a probability of the accuracy of the returned prediction: PredSkin and REACHAcross™. These models each provide predictions with a probability of accuracy down to 50%. The distributions of probabilities for both models for the Basketter et al. (2014) and the HSDB data sets are shown in Figure 10A and 10B, respectively. REACHAcross™ had some predictions that had very high probabilities of accuracy (i.e., 90-99% range) for both data sets, but the majority of the predictions for REACHAcross™ were associated with 50-59% probability sues with the models assessing certain chemical categories or simply by chance. In general, the frequency of false predictions was not increased when compared to the expected distribution for either data set, with the exception of the ester/phenol chemical class in the Basketter et al. (2014) data set (see Fig. 9), which had an increased incidence of false predictions in many of the models, including QSAR Toolbox (6-fold increase), CAESAR (4-fold increase), TIMES-SS (3-fold increase), REACHAcross™ (2-fold increase), and Toxtree (2-fold increase). For the HSDB data set, the frequency of false predictions was not increased for any chemical category.
It has often been assumed that above a certain molecular weight or partition coefficient, a chemical is not a skin sensitizer (Smith Pease et al., 2003). Our data set contained few chemicals that were above the molecular weight of 500 g/mol or with  ed to be a sensitizer. If the parent compound is not predicted to be a sensitizer but the metabolite is predicted to be a sensitizer, the overall sensitization status is predicted to be a sensitizer based on the predicted status of the metabolite. If neither the parent chemical nor the metabolite is predicted to be a skin sensitizer, then the overall prediction is "non-sensitizer".
The effect of metabolism on accuracy is shown in Figure 11. Overall model accuracy improved significantly in the Basketter et al. (2014) data set and modestly in the HSDB data set after incorporation of metabolism. This was particularly striking for sensitizers, as the accuracy increased by approximately 30% in both data sets. However, metabolism did not appear to affect accuracy for non-sensitizers in the Basketter et al. (2014) data set, and it decreased the accuracy of non-sensitizers in the HSDB data set. Likewise, total false predictions decreased substantially for the Basketter et al. (2014) data set and modestly for the HSDB data set after consideration of metabolism. Again, predic-of accuracy for both data sets. The highest probability of accuracy for a prediction from PredSkin fell in the 80-89% range for both data sets, but the majority of the predictions were associated with a probability of accuracy between 70 and 79%.
While PredSkin and REACHAcross™ offer a probability of accuracy feature, TIMES-SS and QSAR Toolbox both include a metabolism feature in their design. QSAR Toolbox considers metabolism in its automated workflow skin sensitization assessment; however, in Version 4.1 of the QSAR Toolbox (the version used in this analysis), the automated workflow in batch mode does not identify whether the parent or metabolite is the chemical on which the skin sensitization alert is based. As a result, QSAR Toolbox was not included in this metabolism sub-analysis.
TIMES-SS is designed to perform optimally when metabolism is incorporated. TIMES-SS evaluates the skin sensitization potential of the parent compound and all potential metabolites. If the parent compound is a sensitizer, then the chemical is predict-  as Safety Data Sheets (SDS) or International Chemical Safety Cards (ICSC), or brief statements from summary texts, such as chemical or toxicity handbooks (e.g., Pohanish, 2012;Dreisbach, 1977;Hayes and Laws, 1991). Due to the limited details provided from studies/statements drawn from the Signs and Symptoms category, it often was not clear whether the primary skin sensitization data from these sources was from human or animal data. Some summaries also failed to provide sufficient details on the identity of the test substance, which made it unclear whether the test substance in the HSDB summaries was a pure chemical or a mixture, limiting the usefulness of the studies. Consequently, test substance data must be reported in a transparent manner to ensure accurate hazard assessment. Moreover, given that most chemicals in the data set had only one reliable study and/or hazard statement on which the skin sensitization status of the chemical was based, this highlights the need for increased data availability, especially for human data. Nonetheless, even limited data of lower quality can serve as an informative starting point for skin sensitization characterization, especially in a screening level hazard assessment. It is noteworthy that, with the exception of the Danish QSAR Database, the accuracies and balanced accuracies for all the models were greater for the Basketter et al. (2014) data set than for the HSDB data set. It is likely that many of the chemicals in the Basketter et al. (2014) data set are in the training set for many, if not all, of these models. The HSDB data set includes a wider range of chemical classes and, arguably, provides a more realistic picture of the chemical space encountered in hazard assessment.

Model performance
Model performance was evaluated on several metrics, including coverage, accuracy/balanced accuracy, sensitivity, specificity, and false prediction rates. Because one performance metric cannot necessarily be weighted higher than the others and because the models incorporate optimization features to different degrees, it is not possible to select one model that clearly outperforms the others. Still, some of the models offer advantages over the others.
First, even a simple structural alert model like Toxtree performs comparably to more complicated read-across and QSAR models. Compared to the balanced accuracies for the read-across and QSAR models, Toxtree ranked in the middle of all the other models' balanced accuracies for both the Basketter et al. (2014) and HSDB data sets. This implies that structural alerts are still useful in the evaluation of skin sensitization. However, they should be used with caution, as Toxtree had one of the highest percentages for false negatives for the Basketter et al. (2014) data set at 33% and had the highest number of false negatives for the HSDB data set at 45%, which suggests that it is not as conservative in its predictions as read-across and QSAR models. On the other hand, by virtue of being a structural alert system, Toxtree had complete coverage of both data sets and therefore serves as a useful first step in a screening hazard assessment to prioritize chemicals through either additional modeling and/or testing. In fact, the screening/confirmatory approach as well as combining structural alerts with read-across models and/or QSAR mod-tions for sensitizers improved most when metabolism was incorporated, as false predictions decreased around 30% for both data sets. However, the false predictions for the non-sensitizers were unaffected by metabolism in the Basketter et al. (2014) data set and increased for the HSDB data set.
An analysis to determine the effect of the sensitization status of Category 5 chemicals in the Basketter et al. (2014) data set on model accuracy was performed. Specifically, the sensitization status of the Category 5 chemicals was considered to be negative for the purposes of this part of the assessment (in previous parts of this assessment, they were categorized as positive for sensitization). The model accuracies and balanced accuracies were recalculated to determine how model accuracy was affected by this change in skin sensitization status. The results of the model accuracies and balanced accuracies with Category 5 as a positive skin sensitization category (i.e., the sensitization status originally assigned to the Basketter et al. (2014) data set) and with Category 5 as a negative skin sensitization category are compared in Table 6. As shown, some model accuracies and balanced accuracies improved while others declined; there was no clear pattern in the effect of Category 5 skin sensitization status in the model accuracies and balanced accuracies.

HSDB data set
While the curated data set from HSDB offers a fairly large and novel data set for human hazard assessment, it also illustrates the lack of depth and precision typically available to hazard assessors, especially for human data. To begin with, terminology is often confused: during the curation process, it was observed that the literature contains numerous other terms that are intended to represent skin sensitization, including contact dermatitis, contact eczema, eczematous dermatosis, and skin allergy. Skin sensitization is defined as a delayed hypersensitivity reaction (Roberts and Patlewicz, 2009), but some of these terms refer to conditions with immediate reactions. Further, some terms are umbrella terms meant to capture several skin conditions (e.g., contact dermatitis is an umbrella term that may refer to allergic or irritant contact dermatitis, and irritant contact dermatitis is not an allergic response (Cashman et al., 2012)) and may not specifically represent a delayed sensitization reaction. Therefore, the lack of precision in reporting human health effects both makes data curation more difficult and compromises data accuracy. One key necessity for improving our ability to predict human skin sensitizers will be a more concerted effort to use accurate and consistent terminology in reporting human skin sensitization data.
The most frequently occurring category was Human Exposure Studies; summaries from this category provided a relatively high level of detail, including test substance description and protocol information (e.g., number of test subjects, dose, etc.). While entries from this HSDB category comprised studies that were relatively well-described, many of the chemical entries for other high-volume categories, such as Signs and Symptoms, reported only hazard classifications from regulatory documents such vative, it has relatively high accuracy and very high sensitivity, making it useful for screening purposes to eliminate chemicals within the domain of applicability that are legitimate sensitization concerns.
While the models offer moderate overall balanced accuracy, some of the models excel at identifying either true sensitizers or true non-sensitizers, as supported by high sensitivity and high specificity. Specifically, PredSkin, QSAR Toolbox, and CAE-SAR each had a sensitivity of 80% or greater for both data sets, with PredSkin almost achieving perfect sensitivity for both data sets (98% for Basketter et al. (2014) and 100% for HSDB). Specificity was less consistent, but TIMES-SS achieved 100% specificity, while Toxtree, REACHAcross™, and Derek achieved 80% or greater for the Basketter et al. (2014) data set. The Danish QSAR Database achieved 84% specificity for the HSDB data set, but none of the remaining models attained high specificity for the HSDB data set, likely reflecting that this was a more diverse data set and consisted of fewer chemicals likely to be in the model training sets. In some cases, high sensitivity may come at the expense of specificity and vice versa; however, this may be a tradeoff hazard assessors are willing to make, especially if they are using a combination of models. This approach may be more informative than the traditional approach that relies only on overall accuracy of a single model. Verheyen et al. (2017) have demonstrated that the combination of two QSAR models improves prediction accuracy compared to the use of an individual model, but using three or more models reduces the prediction accuracy.
Of note, our analysis did not include a comparison of each of the two test sets (i.e., Basketter et al., 2014 and HSDB) to the training sets of each of the 8 in silico models. This was because the training sets of several of the models are not publicly available; therefore, the overlap of each of the test sets with the training sets from each of the 8 in silico models could not be assessed. Depending on the degree of overlap, the global performance metrics could be skewed in favor of those models that have a high degree of overlap between the test and training sets. Nevertheless, we feel that this analysis would not significantly alter our assessment, as hazard assessors carrying out screening assessments do not typically compare their test set with the training set before running an analysis with an in silico tool. Instead, assessors consider whether the chemical falls within the applicability domain or the probability of accuracy to determine whether the prediction is reliable.
Finally, although many models offer advantages over others, a disadvantage common to all models is the availability of a universal, accurate, and complete structural input parameter. CAS-RNs aim to fulfill this role; however, the CASRNs are often incorrectly used interchangeably, and many chemicals have more than one CASRN. SMILES strings come with their own set of issues, for example, some of the SMILES strings sourced from ChemIDplus Advanced were provided in a format to reflect their multi-positional substituents. While this is the accurate SMILES string presentation for these chemicals, this unspecified or undefined format creates two major problems: first, many models simply cannot read this unspecified format, and second, other models force a prediction based on the unspecified SMILES string, which can sometimes produce inaccurate predictions. Ad-els has been evaluated and proposed in other work (Alves et al., 2016b). Similarly, Derek, another structural alert-based model, which pairs structural alerts with expert rules, illustrates the power of a structural-alert model, as supported by its high coverage and low incidence of false predictions. Specifically, this model has the potential to resolve false predictions by applying expert rules that address common structural alerts which trigger false predictions. This pairing allows for increased accuracy when compared to a basic structural alert model.
Additionally, TIMES-SS saw considerable improvement in prediction accuracy for sensitizers and a decrease in the false predictions in both the Basketter et al. (2014) and HSDB data sets when metabolism was incorporated, emphasizing the relevance of metabolism to skin sensitization (Smith Pease et al., 2003). However, there was an increase in the false positives rate for non-sensitizers in the HSDB with predicted metabolism. Thus, the effects of metabolism predictions need to be better understood for reaching conclusions for non-sensitizers, perhaps in combination with a better understanding of skin penetration. Other models that do not explicitly address metabolism through skin metabolism simulators, as TIMES-SS does, still have relatively high accuracy, although this may be because metabolism is implicitly included through the inclusion of chemicals that require metabolic activation in the training data sets and classification based on in vivo studies where metabolism naturally occurs. However, metabolism should be implemented with caution to avoid overly-conservative hazard assessments. Therefore, if metabolism is predicted to drive the skin sensitization status of a chemical, it should be flagged for further review to determine whether the metabolite is both feasible and relevant.
Very few in silico tools offer a probability of accuracy, but this feature can improve model performance if employed correctly. Predskin and REACHAcross™ both offer this "optimization" feature, but it comes with strengths and weaknesses. PredSkin and REACHAcross™ both make predictions where data availability is minimal; however, this results in some predictions with low probability of accuracy, with some predictions returning a probability of accuracy as low as 50%. Because a probability of accuracy this low is not sufficient to confidently determine the sensitization status of a chemical, it is a near impossible case to make for regulatory acceptance. However, the transparency provided by the probability of accuracy is useful to the hazard assessor: It informs the confidence of the prediction, allowing the hazard assessor to more accurately determine whether the prediction is worth incorporating into a chemical hazard assessment. It should be noted that Derek also offers a probability of accuracy metric (likelihood levels); however, as it is a qualitative metric, it was not included in this analysis.
QSAR Toolbox's accuracy highlights the importance of two factors in the evaluation of skin sensitization: selection of good analogs and consideration of mechanistic similarity. While coverage was low for this model, this was driven by the fact that this model does not force predictions when good analogs are unavailable. The analogs in this model are selected by first considering mechanistic similarity as opposed to the traditional approach of choosing surrogates based on structural similarity. While the model has low coverage and is undoubtedly conser-for prioritization and screening level assessments, it is also true that there remains work to be done.