Improved Defined Approaches for Predicting Skin Sensitization Hazard and Potency in Humans

Since the EU banned animal testing for cosmetic products and ingredients in 2013, many defined approaches (DA) for skin sensitization assessment have been developed. Machine learning models were shown to be effective in DAs, but the predictivity might be affected by data imbalance (i.e., more sensitizers than non-sensitizers) and limited information in the databases. To improve the predictivity of DAs, we attempted to apply data-rebalancing ensemble learning (bagging with support vector machine (SVM)) and a novel and comprehensive Cosmetics Europe database. For predicting human hazard and three-class potency, 12 models were built for each using a training set of 96 substances and a test set of 32 substances from the database. The model that predicted hazard with the highest accuracy (90.63% for the test set and 88.54% for the training set, named hazard-DA) used SVM-bagging with combinations of all variables (V6), while the model that predicted potency with the highest accuracy (68.75% for the test set and 82.29% for the training set, named potency-DA) used SVM alone. Both DAs showed better performance than LLNA and other machine learning-based DAs, and the potency-DA provided more in-depth assessment. These findings indicate that SVM-bagging-based DAs provide enhanced predictivity for hazard assessment by further data rebalancing. Meanwhile, the effect of imbalanced data might be offset by more detailed categorization of sensitizers for potency assessment, thus SVM-based DA without bagging could provide sufficient predictivity. The improved DAs in this study could be promising tools for skin sensitization assessment without animal testing. This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is appropriately cited.


Introduction
Allergic contact dermatitis (ACD) is a clinically relevant condition induced by contact with an allergen, and 15%-20% of people suffer from it at some point in their life (Thyssen et al. 2007).To assess skin sensitization, animal methods such as local lymph node assay (LLNA) and guinea-pig maximization test (GMPT) have been adopted as the "benchmark" methods in many countries (Daniel et al. 2018).However, in the context of growing concern about animal welfare, the EU banned animal testing for cosmetic products and their ingredients in 2013 (EU 2009).Thus, there is now an urgent need to develop non-animal methods that can fully reflect skin sensitization.
Substantial progress has been made in this regard by developing in vitro assays addressing different key events (KEs) in the skin sensitization adverse outcome pathway (AOP) (OECD 2015;OECD 2018a;OECD 2018b).The described KEs in the AOP include the molecular initiating event (covalent binding to skin proteins) and the cellular response (activation of keratinocytes and dendritic cells) to the sensitizers.To evaluate these KEs, in vitro methods such as direct peptide reactivity assay (DPRA), KeratinoSensTM, and h-CLAT have been developed and accepted by Organisation for Economic Co-operation and Development (OECD) (Emter et al. 2010;Gerberick et al. 2004;Sakaguchi et al. 2006).However, no single in vitro method can comprehensively represent the complexity of the processes involved in skin sensitization (Osamu et al. 2015).Therefore, as the components of the integrated approaches to testing and assessment (IATA), many defined approaches (DA), which cover complementary characteristics of the in vitro methods and further take physicochemical properties and structure into consideration, have greatly improved the predictivity of skin sensitization (Worth and Patlewicz 2016).
There are many algorithms as the data interpretation procedure for the defined approaches development, including some simple rule-based strategies and machine learning models.In machine learning models, the parameters could improve with increases in input data due to their characteristic of being data driven, resulting that quantitative prediction of potency could be achieved in those DAs, but not in simple rule-based DAs (Jaworska et al. 2015;Kleinstreuer et al. 2018).Among the available machine learning models, the SVM model showed higher predictivity for human hazard, with accuracy of up to 81.70% compared with other machine learning models [ artificial neural network (ANN) and Bayesian network (BN) ] (Kleinstreuer et al. 2018), and thus was considered as the promising model.
However, the databases adopted in the available DAs usually include much more number of sensitizers than nonsensitizers (named imbalanced data) (Jaworska et al., 2015;Hoffman et al., 2018), which might affect the predictivity of DAs.Previous works have shown that the majority classes in the database are often correctly predicted by SVM, whereas the minority classes tend to be misclassified (López et al., 2013;Li et al., 2016).Fortunately, many studies in other fields showed that the combination of bagging with SVM could significantly improve the predictivity with imbalanced data (Mordelet and Vert 2014;Yu et al. 2018;Zararsiz et al. 2012).Bagging method could be applied to generate new balanced training subsets suitable for building SVM model by sampling with replacement from a given imbalanced training set (Guo et al., 2017).Therefore, the combination of SVM and bagging might be an effective option to develop DA for assessing skin sensitization.
Due to complexity of skin sensitization, it is essential to apply a database which consists of more comprehensive data from test methods together with animal and human reference data for defined approaches development.Human potency was deficient in previous databases but the LLNA EC3 or LLNA binary data (Hirota et al. 2015;Strickland et al. 2016).However, animal data is not a precise prediction target as some high potency substances in human were misclassified as nonsensitizers in LLNA, such as 6-Methyl-3,5-heptadien-2-one and tea leaf absolute (Api et al., 2017;Hoffmann et al., 2018).Besides, previous databases contained less data from OECD newly accepted in vitro assays and limited number of cosmetic substances, which might reduce the predictivity for skin sensitization assessment of the cosmetic substances.Thus, to improve DA predictivity, the Cosmetics Europe database was established by compiled of existing and newly generated data of the test methods together with LLNA and human reference data for 128 substances (Hoffmann et al., 2018).This new database provides more detailed information for skin sensitization, including human potency categories 1-6, LLNA data, continuous data from in vitro assays, and physicochemical properties of the chemicals in cosmetics.The human potency categories 1-6 (1, 2: high potency; 3, 4: low potency; 5, 6: non-sensitizer) based on clinical patch test data with specific cutoffs for exposure and incidence are now considered as better potency targets for assessment (Api et al. 2017;Basketter et al. 2014).Moreover, the continuous data from in vitro assays also can produce higher performance DA than binary data (Zang et al. 2017).Thus, developing DA using the Cosmetics Europe database for predicting human potency could enable close assessment of the skin sensitization of chemicals.
Against this background, with the aim of further improving the predictive performance regarding human hazard and potency, we attempted to develop novel DAs by applying ensemble learning (SVM-bagging) and the newly established Cosmetics Europe database.All substances were divided into a training set of 96 substances and a test set of 32 substances.The predictivity of novel DAs was validated by using a test set and compared with the LLNA and published DAs (Kleinstreuer et al., 2018).

2
Materials and Methods

Substance database
We applied the Cosmetics Europe database, which has excluded metals or other less-frequently-used substances in cosmetics.
For the 128 substances in the database, data on human potency class, DPRA, h-CLAT, KeratinoSens™, and six physicochemical properties relevant to skin exposure and penetration were collected.The above six physicochemical properties were the octanol:water partition coefficient, water solubility, vapor pressure, melting point, boiling point and molecular weight, see the supplementary file2 (Hoffmann et al. 2018).

Characterization of the substances
Substances were assigned to six human potency classes in the database according to data from human maximization tests (HMT), human repeat insult patch tests (HRIPT) and diagnostic patch tests (DPT), and further category assignments were as follows (Hoffmann et al., 2018;Kleinstreuer et al., 2018).The classes 5 and 6 were non-sensitizers because of their large dose to trigger skin sensitization.Classes 3 and 4 were low potency sensitizers, and the class 1 and 2 were high potency sensitizers.In this study, low potency and high potency sensitizers were assigned as 1 and 2 respectively, and non-sensitizers were assigned as 0. Of the 128 substances, 68.75% (88/128) were classified as positive for sensitization in human and 31.25%(40/128) were classified as negative.Among the sensitizers, 58 substances were low potency sensitizers, while the others were high-potency ones.Skin sensitizers may require oxidation (pre-haptens) and/or metabolism (pro-haptens) in order to produce a skin sensitization reaction.Among the 88 sensitizers, 10 were pre-haptens, 3 were pro-haptens and 9 were pre/pro-haptens.

Model variables 2.3.1 DPRA
DPRA is an in-chemico test that assesses the ability of a substance to form a hapten-protein complex, which is the KE1 in the skin sensitization AOP.It measures the reactivity of a test substance towards two model synthetic peptides, one containing lysine (mixed at a ratio of 1:50 with the test substance) and the other containing cysteine (mixed at a ratio of 1:10 with the test substance).The depletion of the peptides after a 24-h incubation with the test substance is measured using high performance liquid chromatography.Data used from the DPRA included average cysteine peptide depletion (Cys), average lysine peptide depletion (Lys), average depletion of cysteine and lysine peptides (Avg.Lys.Cys) and sensitizer/non-sensitizer outcome based on a decision tree.Data of Cys, Lys and Avg.Lys.Cys were used as the model variables, which were available for 126 of the substances.The negative peptide depletion values were set to zero.The mean value of the Cys and Lys of the other 126 substances were applied respectively as the Cys and Lys for dextran and 2-hexylidene cyclopentanone were not available.

KeratinoSens™
KeratinoSens™ assesses the ability of substances to activate cytokines and induce the expression of cytoprotective genes in keratinocytes based on activation of the Keap1-Nrf2 pathway of KE2 (keratinocyte activation) in AOP.This assay measures the antioxidant response element (ARE)-induced luciferase expression in a stable human keratinocyte cell line.The luciferase expression and cell viability are measured after 48 h of incubation with the test substance.The test substance is classified as a sensitizer if luciferase expression is activated over 1.5-fold compared with that in vehicle control and cell viability is over 70%.The EC1.5 value was the concentration at which luciferase expression was activated 1.5-fold, which was used as the model variable and available for all 128 of the substances.

h-CLAT
h-CLAT assesses the ability of substances to activate and mobilize dendritic cells in the skin based on KE3 in the AOP.The assay measures the change of expression of CD86 and CD54 surface marker expression in human THP-1 cells by flow cytometry.Substances are classified as sensitizers if the relative fluorescence intensity is at least 50% more than the baseline level for CD86 or at least 200% of the baseline level for CD54 at concentrations where cell viability was ≥50% of the control in at least two of three independent tests.There were up to 71 missing data for CD54 EC200 and 45 missing data for CD86 EC150 in the database.Thus, the binary outcomes of h-CLAT were used as the model variable and were available for 127 of the substances, while the missing h-CLAT outcome of 2-hexylidene cyclopentanone was imputed to be positive in accordance with the DPRA and KeratinoSens™.

Physicochemical properties
We adopted the data on octanol:water partition coefficient, water solubility, vapor pressure, molecular weight, melting point and boiling point as variables, which were available on 122 substances.The mean value of the physicochemical properties of the 122 substances were applied respectively when the corresponding data were not available in 6 substances.

Data processing 2.4.1 Selection of training set and test set
The 128 substances in the database were divided into training and external test sets at rates of 75% to 25%, respectively.Based on the human potency classes assigned in the database, all substances were first classified as sensitizers or nonsensitizers, after which the sensitizers were classified as having high potency or low potency.

Building predictive models
We used the training set of 96 substances to build models for predicting human outcomes using the SVM model, and using the trained SVM as the base model for the bagging method.Bagging method is a re-sampling technique and has been applied by generating variable training subsets by sampling with replacement from a standard training set.Model building was implemented using packages in the scikit-learn in Python (Pedregosa et al. 2011).Prediction models were initially developed using SVM and each of six variable sets based on different combinations of the 11 variables collected, and then were integrated with the bagging method to improve the performance.Thus, for predicting human hazard and potency, 12 models each were built.
X X X X DPRA Cys: depletion of cysteine peptide of the direct peptide reactivity assay; DPRA Lys: depletion of cysteine peptide of the direct peptide reactivity assay; Avg.Cys.Lys: average depletion for cysteine and lysine; h-CLAT Binary Result: the binary result of human cell line activation test; Log S: log water solubility; Log P: log octanol:water partition coefficient; Log VP: log vapor pressure.a : X denote the input variables included in each variable set

Evaluation of model performance
Model performance for hazard assessment was evaluated by calculating the sensitivity, specificity and accuracy for predicting human outcomes using Cooper statistics as the formulae below (Strickland et al. 2016).The selection of the best model for hazard assessment was based on the accuracy and the mean value of sensitivity and specificity in both test set and training set.Model performance for potency assessment was evaluated by calculating accuracy, over-predicted rate and under-predicted rate as the formulae below (Kleinstreuer et al. 2018).The selection of best model for potency was based on the accuracy and the mean value of over-predicted and under-predicted in both test set and training set.Any candidate model that incorrectly predicted high-potency substances as non-sensitizers could not qualified as the best model.

Predictivity of the machine learning models
The performance of the 12 models for predicting human hazard is shown in Table 2.The accuracy for the training set ranged from 76.04% to 98.48%, while that for the test set ranged from 50.00% to 93.75%.Overall, bagging method mainly improved the accuracy by improving the corresponding specificity which increased either in training set or test set for 3 of the 6 SVM models (i.e.SVM V1, V5 and V6).Besides, the difference of specificity between training set and test set tended to decrease with more input variables.
The best model for assessing human hazard was SVM-bagging-V6 (bagging model using V6 variable set and SVM as the base model) without misclassified high-potency substance as non-sensitizer.For the test set, it had 90.63% accuracy and an average of 90.46% for specificity and sensitivity.For training set, it had 88.54% accuracy and an average of 84.39% for specificity and sensitivity.SVM-bagging-V6 is referred to as hazard-DA for further evaluation.

Substances misclassified by hazard-DA in the training set
Hazard-DA misclassified 11 substances in the training set, with 3 false negatives and 8 false positives (Tab.3).The falsenegative substances were isocyclogeraniol, benzyl alcohol and benzyl cinnamate, which were all low potency substances and not pro/pre-haptens.h-CLAT was the only in vitro method that correctly identified isocyclogeraniol and benzyl alcohol as sensitizers, and KeratinoSens™ was the only in vitro method that correctly identified benzyl cinnamate as a sensitizer.

Substances misclassified by hazard-DA in test set
Hazard-DA misclassified 3 substances in the test set, with 2 false negatives and 1 false positive (Tab.4).The false-negative substances were benzoyl peroxide and resorcinol, which were all low potency substances.Resorcinol was the only pro-hapten which was misclassified.DPRA was the only in vitro method that correctly identified benzoyl peroxide as a sensitizer, and h-CLAT was the only in vitro method that correctly identified resorcinol as a sensitizer.

Performance of machine learning models for predicting human potency 3.2.1 Predictivity of the machine learning models
The performance of the 12 models for predicting human potency is shown in Table 5, the accuracy for the training set ranged from 63.54% to 84.38%, while that for the test set ranged from 28.13% to 71.88%.Overall, the bagging method improved accuracy of two SVM models (i.e.SVM V1 and V2) either in training set or in test set, while it did not improve the rest SVM models.SVM-V5 was not considered to be improved by bagging because its accuracy increased in training set but reduced in test set.
The SVM-V5 had the highest accuracy in the test set but misclassified two high potency sensitizers as nonsensitizers.Together with SVM-V5, SVM-V5-bagging misclassified two high potency sensitizers as non-sensitizers.For V6 variable set, bagging was useless for improving the SVM model, and the SVM model showed a little higher accuracy.Therefore, the best model for human potency was SVM-V6 (SVM model using the V6 variable set), which had 68.75% accuracy, 21.88% over-predicted rate and 9.38% under-predicted rate for the test set, and 82.29% accuracy, 10.42% overpredicted rate and 7.29% under-predicted rate for the training set.The SVM-V6 model is referred to as potency-DA for further evaluation.

Substances misclassified by the best model in the training set
Potency-DA misclassified 17 substances in the training set, with 7 under-predicted substances and 10 over-predicted ones (Tab 6).None of the high-potency substances were misclassified as a non-sensitizer, and none of the non-sensitizers were misclassified as a high potency substances.: "2" means high potency, "1" means low potency, "0" means non-sensitizer.

Substances misclassified by the best model in the test set
Potency-DA misclassified 10 substances in the test set, with 3 under-predicted substances and 7 over-predicted substances (Tab.7).None of the high-potency substances were misclassified as a non-sensitizer, and none of non-sensitizers were misclassified as a high-potency substance.

Discussion
Given the bans on the animal testing of cosmetics product and their ingredients in many parts of the world, feasible and accurate defined approaches as alternatives to such testing are needed urgently.Here, novel DAs for predicting human hazard and potency, hazard-DA and potency-DA, were developed.The hazard-DA was generated by the combination of the bagging method and the SVM model, while the potency-DA was generated by the SVM model alone.Both hazard-DA and potency-DA showed higher predictivity than the other machine learning based DA and LLNA (Kleinstreuer et al. 2018) (Tab. 8 and 9).Bagging method improved the accuracy of three SVM models (i.e.SVM V1, V5 and V6) by improving the model predictivity for non-sensitizers and then improving the specificity for hazard assessment.The hazard-DA (SVM-bagging-V6) had an improved accuracy of up to 90.63% in the test set which was contributed by the increase of specificity from 80.00% to 90.00%, while the accuracy of corresponding single SVM model was only 87.50%.Furthermore, hazard-DA showed highest predictivity than the other validated machine learning based DAs and LLNA.This indicated that bagging indeed helped to rebalance the imbalanced hazard data thus could improve the model performance, which was consistent with the previous research in another field (Yu et al. 2018).However, for assessment of the three-class potency, the bagging method improved accuracy of two SVM models (i.e.SVM V1 and V2) either in training set or in test set, while it did not improve the rest SVM models with more input variables.A potential explanation for the unexpected phenomena is that the effect of imbalanced data on predicting potency had been offset by more detailed categorization of sensitizers in the database, resulting the bagging method useless in those models.
Regarding the predictivity of potency-DA, it had an accuracy of 68.75% in the test set, which was higher than for LLNA.According to LLNA data, one non-sensitizer in the human data was classified as a high-potency substance (benzalkonium chloride), and two high-potency sensitizers in the human data were classified as non-sensitizers (6-Methyl-3,5-heptadien-2-one and tea leaf absolute).This might have resulted from the species differences in skin sensitization (Natsch and Emter 2015).Instead, as for potency-DA, high-potency substances and non-sensitizers were not misclassified between these groups.Consequently, potency-DA using the Cosmetics Europe database including data from human cell lines (KeratinoSens™ and h-CLAT) could assess the sensitization more closely than when using animal testing.
At present, an increasing number of in vitro assays for skin sensitization are being developed, so the robustness of such assays might become an important factor for DA's performance.For example, addressing the KE2, KeratinoSens™ showed good predictivity for hazard assessment, but its robustness may be affected by random integration (i.e.The plasmids of ARE-luciferase reporter cassette insert into the genome by random) or the ignorance of other skin sensitization regulation factors (Lai et al. 2016;Uemura et al. 2016;Soldner et al. 2016).By contrast, our group recently developed the EndoSens assay by precise knock-in of a reporter gene into the HMOX1 expression cassette was more robust and could be a better choice as an input variable for DA (Zhong et al., 2018).Besides, other in vitro assays addressing the KE3 (i.e. the IL-8 Luc and U-SENS) have also been validated and accepted by OECD (OECD 2018a).Thus, there is still great potential for optimizing DA's performance by using data from more robust in vitro assays addressing different KEs of AOP when more testing data are available.
In conclusion, hazard-DA and potency-DA in this study should be promising DAs for predicting human hazard and potency.Further work should focus on testing the models with an expanded set of substances, and applying data from more validated and regulatory accepted assays available to develop a more accurate DA for potency assessment.

Supplementary file
The supplementary file 2 contains the data from Cosmetics Europe database and the prediction results of the models in this study.

Six variable sets used to build models for predicting human hazard and potency Variable Variable set a
Table 1 defines the six variable sets.The numbers in the column headings represent Variable Sets 1 through 6, and the Xs in each column indicate which data were included in each variable set.

Tab. 2: Performance of 12 models for predicting human hazard
The training set of 96 substances contains 30 non-sensitizers and 66 sensitizers.The test set of 32 substances contains 10 non-sensitizers and 22 sensitizers.
a : SVM, support vector machine; SVM-Bagging, bagging method with SVM as its base model.b : Variables in each variable set were shown in table 1. c :

: Performance of 12 models for predicting human potency Model a Variable set b
The training set of 96 substances contains 30 non-sensitizers, 44 low potency sensitizers and 22 high potency sensitizers.The test set of 32 substances contains 10 non-sensitizers, 14 low potency sensitizers and 8 high potency sensitizers.
a : SVM, support vector machine; SVM-Bagging, bagging method with SVM as its base model.b : Variables in each variable set were shown in table 1. c :

Tab. 8: Performance of DAs and LLNA for predicting human hazard a
Kleinstreuer et al., 201818