Integrated Skin Sensitization Assessment Based on OECD Methods (III): Adding Human Data to the Potency Assessment

skin sensitization endpoint (OECD, 2021a). These DA use the methods that were first implemented as OECD TGs, namely the direct peptide reactivity assay (DPRA) (Gerberick et al., 2004), the KeratinoSens assay (KS) (Emter et al., 2010), and the human cell line activation assay (h-CLAT) (Sakaguchi et al., 2006). These individual methods and the DA primarily yield a binary output to determine the sensitization hazard of chemicals. In addition, the DA “integrated testing strategy


Introduction
Development of predictive tools for the skin sensitization endpoint has been a key focus in the area of new approach methodologies (NAM) to replace animal tests, culminating in the publication by the OECD of test guidelines (TGs) based on three key events of the sensitization adverse outcome pathway (AOP) (OECD, 2014).So far, these TGs cover nine different test protocols (Natsch and Gerberick, 2022a;OECD, 2018aOECD, ,b, 2023)).This was followed by the first OECD TG implementing defined approaches (DA), which combine several test methods to an integrated assessment of the skin sensitization endpoint (OECD, 2021a).These DA use the methods that were first implemented as OECD TGs, namely the direct peptide reactivity assay (DPRA) (Gerberick et al., 2004), the KeratinoSens assay (KS) (Emter et al., 2010), and the human cell line activation assay (h-CLAT) (Sakaguchi et al., 2006).These individual methods and the DA primarily yield a binary output to determine the sensitization hazard of chemicals.In addition, the DA "integrated testing strategy Skin sensitizer potency assessment based on new approach methodologies is key to deriving a point of departure (PoD) for risk assessment.Regression models to predict a PoD based on OECD-validated in vitro tests and trained on local lymph node assay (LLNA) data were previously presented, and results from human tests were recently compiled.To integrate both data sources, the Reference Chemical Potency List (RCPL), which provides potency values (PV) for 33 chemicals integrating LLNA and human data in a structured weight-of-evidence approach, was developed.When calculating regression models vs PV or LLNA data, different weights for the input parameters were noted.As the RCPL is based on too few chemicals to train robust statistical models, the list of human data was extended to a larger set of PV (n = 139) with associated in vitro data.This database was used to retrain the regression models and to compare regression models trained vs (i) LLNA, (ii) PV or (iii) human DSA04 values.Using the PV as a target, predictive models of similar predictivity to the LLNA-based models were obtained, which mainly differ in a lesser weight of cytotoxicity and a higher weight of cell activation and reactivity parameters.Analysis of the human DSA04 dataset indicates a similar pattern but also shows that the human dataset is too small and biased to be a key dataset for potency prediction.Hence, an enlarged set of PV values appears to be a complementary tool to train predictive models next to an LLNA-only database.2010), and they have the advantage that they also truly measure elicitation of skin sensitization after an induction phase.
Recently, the Reference Chemical Potency List (RCPL), which presents potency values (PV) for 33 chemicals in an attempt to integrate both LLNA and human data in a structured WoE approach following discrete workflows, was published (Irizar et al., 2022).For studies with positive cases of sensitization, the human data were expressed as DSA04 values, i.e., the estimated dose per surface area (DSA, expressed in μg/cm 2 ) of a chemical that would be required to induce skin sensitization in 4% of exposed subjects under the conditions of an HRIPT or HMT.The DSA04 value was selected in preference to any other DSA value as it was found to correlate best with LLNA EC3 values on a set of 20 chemicals and thus should be a metric directly comparable to the EC3 expressed as dose per area.Although the evidence to give preference to the DSA04 over, e.g., the DSA05 (as used in ICCVAM studies) (ICCVAM, 2011) or the DSA02 is limited, this work is based on the DSA04 to build on the analysis made for the RCPL.The aim of the RCPL is to serve as benchmark to evaluate the ability of NAM (and quantitative DA based on NAM data) to rank the sensitizer potency of chemicals and to compare different approaches for PoD determination.However, the RCPL may be based on too few chemicals to train robust statistical models for potency determination, especially if they are based on multiple input parameters as the higher the number of input variables, the greater the risk of obtaining over-fitted models.In addition, a key focus of the RCPL is to assess potency prediction for fragrance allergens, and thus the chemical selection was intentionally biased towards fragrance ingredients.Therefore, there is a risk of limited coverage of the chemical space for any model trained on this small dataset.
Here we first developed an extended WoE list of chemicals with DSA04 values or other robust sensitization evidence and derived a PV for these chemicals according to the RCPL workflows.Based on the same in vitro input data from the previous, LLNA-focused analysis (Natsch and Gerberick, 2022a,b), regression models were trained vs (i) LLNA EC3, (ii) PV values, and (iii) human DSA04 values in the different data subsets in order to compare predictivity for different in vivo targets.

In vitro database, data normalization and regression analysis
This paper is based on the database of in vitro and LLNA data published previously as ESM1 1 by Natsch and Gerberick strated how this assessment can then be combined with the DA "2 out of 3" as implemented in TG 497 (Natsch and Gerberick, 2022b), as the same test methods used for the regulatory classification are applied for the PoD assessment.These models, similar to the detailed Bayesian models developed by Jaworska et al. (2013Jaworska et al. ( , 2015)), were trained on EC3 values from the LLNA as the in vivo measure of sensitization potency.The EC3 gives the concentration for a three-fold stimulation of cell proliferation in the draining lymph nodes of the treated mouse ear.
The LLNA offers the most comprehensive source for in vivo potency data, and it is also the only data source with dose-response data, while tests in humans or on guinea pigs are mostly done at only one test concentration.Hence, the LLNA data appear most attractive to develop potency predictions in terms of data quality and size of the available database.However, the LLNA measures overall cell proliferation in the local lymph node as a proxy for the induction of skin sensitization and not the proliferation of hapten-specific T-cells, which would be the true endpoint we would like to predict.In addition, we might be faced with a species difference when using mice instead of human data.Nevertheless, this effect is generally considered minor for the sensitization endpoint as all key human sensitizers can sensitize laboratory animals.Furthermore, a correlation between human potency expressed as DSA05 and potency in mice expressed as LLNA EC3 with a slope close to one and a y-intercept close to zero has been reported, indicating that there is no fundamental species difference here (Api et al., 2015;Bil et al., 2017).
Next to LLNA data, results from human tests have been compiled, e.g., by the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) (ICCVAM, 2008(ICCVAM, , 2011) ) and the OECD expert group (OECD, 2021d).These data mostly originate from the human repeat insult patch test (HRIPT) (Politano and Api, 2008) or the human maximization test (HMT) (Kligman, 1966).The human tests truly measure sensitization in the species of interest, as they test for clinical hypersensitivity reactions after an induction phase with repeated exposure.However, as these human data lack dose-response information and come from much less standardized protocols, it remains an open question which data source is optimal to develop predictive models for sensitization potency.
Next to human and LLNA data, data from guinea pig maximization (GPMT) and Buehler tests are available.They were used as key information in the validation of the LLNA (Dean et al., 2001).While it is difficult to derive potency information from guinea pig data, they may be consulted in a weight of evidence (WoE) on whether a chemical is a relevant skin sensitizer (Emter et al dictive approaches for their ability to predict and rank sensitizer potency of chemicals (which is the original goal of the RCPL), it is a relatively small dataset to (re)train any predictive model with multiple input parameters.Here we present an extended set of chemicals and apply the same principles used to derive the PV as were used for the RCPL.
For step (a), the list of chemicals from our prior publication (Natsch and Gerberick, 2022a) and the list of chemicals with LLNA, human, and in vitro data as compiled by the OECD expert group (OECD, 2021b,c,d) were used as a starting point.For each chemical it was assessed whether, next to the LLNA, relevant human data was available and, if not, whether there were GPMT data from the LLNA validation (Haneke et al., 2001), from the database compiled by ICCVAM (2008), or from the RIFM or ECHA database, the idea being that congruent LLNA and guinea pig data can contribute to the WoE for the sensitization status of a chemical.Chemicals were selected for inclusion in the assessment if they were (i) in the RCPL list; (ii) rated negative for human sensitization in the DASS database; (iii) had negative LLNA data supported by negative human HRIPT/HMT data from studies at a high test concentration, other convincing negative human data or negative guinea pig data; (iv) a DSA04 could be derived from positive human data; (v) they had a positive LLNA study supported by either positive human data (positive HRIPT, positive HMT, frequent positive reactions in the clinic) or positive guinea pig studies (in the ICCVAM, RIFM or ECHA database); or (vi) multiple positive LLNA studies.Chemicals that had only a single LLNA study in the absence of other data or with divergent results from guinea pig studies were excluded.With these criteria, chemicals with a less certain status of sensitization hazard were excluded and the dataset was enriched for chemicals with available human data.The rationale for inclusion in the list is given separately for each chemical in the database in the supporting information (ESM1 2 ).
For step (b) -collection of human data -all individual human studies collected by the DASS expert group (OECD, 2021d) were used.In addition, the data collected from the ICCVAM database and summarized by Bil et al. (2017) were consulted, and a few data gaps could be filled with data from the RIFM database (e.g., in Na et al., 2021) not yet included in the DASS database (Tab.1).For the data-rich chemicals in the RCPL list, preference was given to HRIPT studies, as these do not include irritation by SDS and often were conducted on a larger number of subjects.However, here all available positive studies in the DASS human database were used, and if multiple studies with positive cases of sensitization were available for one chemical, a DSA04 was derived for each study.The DSA04 was calculated from the DSA1 + value in the DASS database, which reports the extrapolated concentration for 1 positive reaction, but which is not normalized to the number of panelists.The DSA04 is an extrapolation from the frequency of positive reactions at the tested concentration to derive the concentration at which 4% reaction frequency would be expected.In the absence of knowledge on the true dose-response (2022a), which contained also much of the data analysis conducted by the OECD Defined Approaches for Skin Sensitization (DASS) expert group (OECD, 2021b,c,d).The curated LLNA EC3 values compiled by the DASS expert group were used, except in those instances where additional data, e.g., from the European Chemicals Agency (ECHA), provided a more robust value or if no EC3 was available in the DASS database.For human data, the values recently curated for the RCPL were given preference, for a larger number of chemicals the data curated by the OECD expert groups were used, and in a few instances data gaps were filled with data from Bil et al. (2017) (which are based on the ICCVAM data compilation) or from the Research Institute for Fragrance Materials (RIFM) database.Thus, the majority of in vivo data have previously been carefully curated.
All data normalizations and all regression analyses were performed as described in detail before (Natsch and Gerberick, 2022a).In the previous work, three key and most predictive models were developed: (i) EQ1 combines KS and kDPRA data, (ii) EQ4 combines h-CLAT and kDPRA data, and (iii) EQ5 combines all three data sources.All models also contain a volatility parameter for highly volatile chemicals.For chemicals lacking Cys-peptide reactivity, an additional model based on KS and h-CLAT data (EQ6) was derived.In the current analysis, these four regression equations were recalculated for the different data subsets using the LLNA EC3, the PV or the DSA04 as in vivo target.
The input parameters derived from the different in vitro tests are Log k max norm , the logarithmic normalized rate constant for peptide depletion in the kDPRA; Log EC1.5 norm , the concentration to induce the luciferase gene in KS 1.5-fold; Log IC50 norm , the concentration to reduce cell viability by 50% in the KS; Log MIT norm , the normalized lower value of the concentration to induce the cell surface markers CD54 by 200% or CD86 by 150% in the h-CLAT; Log CV75 norm , the concentration to reduce cell viability by 25% in the h-CLAT, and Log VP norm , describing volatility of the most volatile chemicals.All concentrations are expressed in molarity, and all logarithmic values were linearly normalized to 0 for chemicals with no reactivity, cell activation or cytotoxicity as described before (Natsch and Gerberick, 2022a) and as reported in the previously published database.The normalized in vitro data are reiterated in ESM1 2 ; for source data consult the original dataset.

An extended dataset with potency values and DSA04 values
The concept of PVs used in the RCPL is to (a) select chemicals for which a solid base of in vivo data are available, ideally human data next to LLNA data, to (b) collect LLNA EC3 data and quantitative human potency data in the form of DSA04 and no observed effect level (NOEL) values, and to (c) derive a potency value using a defined WoE workflow integrating both the human and LLNA data, whereas these workflows are based on a general preference for positive human sensitization data if available.The current RCPL contains a total of 33 chemicals.While this number of chemicals is deemed sufficient to compare different pre-2 doi:10.14573/altex.2302081s1 For step (c), i.e., deriving the PV, the workflows from Irizar et al. (2022) were applied.These workflows give preference to human data in case the LLNA value is not close to the human value.In case a PV was available from the RCPL list, this value was chosen by default, as these values had been carefully curated.To compare LLNA and DSA04 values, LLNA EC3 values in percent are multiplied with 250 to be expressed in µg/cm 2 (Griem et al., 2003).The PV is thus expressed in µg/cm 2 and may come from either the transformed LLNA EC3, the DSA04 or from the human NOEL.The rationale and source data for the selected PV is given for each chemical in the database in the supporting information (ESM1 2 ).
In total, comprehensive in vitro data are available for 31 chemicals with existing PV values from Irizar et al. (2022) (no data are available for furaneol and 5-chloro-2-methyl-4-isothiazolinone, which had only been tested in all in vitro tests as the mixture Kathon CG).In addition, for 108 chemicals with comprehensive in vitro data a new PV was derived.Table 2 summarizes the database used for the current analysis and from which data source the PV were derived.For non-sensitizers the default PV is set to 25,000, i.e., an LLNA EC3 value of 100%.Based on the RCPL workflows, expert judgment is needed to derive the PV in some specific cases, but in most cases the workflows led to unambiguous decisions.
To apply the same regression analysis as done previously using pEC3 values, a pPV value was calculated similar to the pEC3 which is then in the same scale and can be directly compared: For non-sensitizers, the pPV is set to 0, i.e., the value for a chemical with MW of 100 and a PV of 25,000.Similarly, a pDSA04 val-it is assumed that the incidence is doubled by doubling the test concentration, thus the DSA04 is: with DSA1 + as reported in the DASS database and n the number of study subjects in a given study.
In another expression, i.e., if DSA1 + was not already given in the DASS database, the DSA04 was calculated as: .
If multiple studies were available, the weighted geometric mean was calculated from the logarithmic weighted geometric mean, giving studies with fewer panelists a lesser weight: n whereby j is the number of studies, n i the number of subjects in a given study and n the total number of subjects in all studies considered.This approach of weighing all the studies available in the DASS database led overall to similar DSA04 values as those derived by the RCPL authors (Fig. 1a) and those calculated based on the collection of DSA05 values in Bil et al. (2017) (Fig. 1b).Of course, these different data collections are largely based on the same primary studies, but the good correlation indicates that pooling the detailed studies collected by the DASS expert team using weighted geometric means overall confirms other assessments and is a valid way for deriving DSA04 values by integrating all collected evidence.For ease of reading, the new regression equations in the text are not explicitly spelled out as above but rather the regression coefficients (shown in bold, underlined text) defining these equations are given in tabular form.The analysis is performed for different combinations of input parameters: a) KS and kDPRA (EQ1 in the previous publication) b) KS and h-CLAT (EQ4 in the previous publication) c) KS, h-CLAT and kDPRA (EQ5 in the previous publication) In addition, the results for the different versions of EQ6 integra ting KS and h-CLAT data are shown in the supporting information (ESM2 3 ).
ue was calculated to directly use DSA04 values for the analysis.
An overview of the different datasets relevant for potency assessments put together by the OECD DASS in our previous analysis and for this work is given in Table 1, while Table 2 gives more details on the database used in the current analysis.

Results
In the analysis below, predictive regression equations were calculated in the form of, e.g., EQ5 from the previous publication given here as an example: dataset.For all combinations of in vitro input parameters, models with similar or higher R 2 values as for the published models were obtained.There are three important observations from this initial analysis: First, comparing the equations for the pEC3 on these 31 chemicals with the published equations indicates a clearly different outcome: In all three combinations of input parameters (EQ1, EQ4 and EQ5) a lesser weight for the reactivity parameter k max and a higher weight for both cytotoxicity parameters and the KS EC1.5 was found (see, e.g., EQ1a vs EQ1).This indicates that based on this small data subset, a clearly different predictive model for the LLNA would be derived, indicating that this small set of chemicals is not representative of the whole chemical dataset with LLNA and in vitro data available, and it is therefore not recommended to use the RCPL to train predictive models.This is not at all surprising given the difference in size (n = 31 vs n = 188-203).
Next, comparing the equations calculated for pEC3 and pPV for the 31 chemicals, an interesting observation is made in all combinations of input parameters (EQ1, EQ4 and EQ5): For predicting the pPV, the coefficients (and hence the contribution to sensitizing potency) of k max , KS EC1.5, and/or h-CLAT MIT increase strongly in parallel to a large drop in the coefficients of both cytotoxicity parameters as compared to the situation when predicting the pEC3 (see, e.g., EQ1a vs EQ1b).This observation may indicate an important mechanistic difference when integrating the human data in the PV and actually triggered my curiosity to evaluate predictive models on a larger dataset of PV values as presented below.
The calculations were performed for different target data: a) LLNA pEC3 b) Potency values expressed as pPV or c) human data expressed as pDSA04.The coefficients of the regression equation from the previous work on predicting the pEC3 value on the comprehensive database (n = 188-203) are given in the Tables first in each case for comparison.First the equation for the pEC3 target in each data subset is calculated and compared to the published equation calculated vs the pEC3.With this first comparison, an estimation can be made to which extent the specific smaller data subset affects the outcome when using the same target (namely pEC3) as in the previous analysis.This is important in order not to confuse changes observed when analyzing the different targets (pPV or pDSA04 vs pEC3) with changes due to the different set of chemicals.
Below, the same analysis is shown first for the published RCPL, then (and as the key analysis) for the extended PV list developed within this work, and finally on the list of chemicals with human sensitization data (DSA04 and human non-sensitizers).
The different equations are labeled with lowercase letters added to the numbers from the previous publications to differentiate equations on different combinations of input parameters, different targets, and different datasets.

Regression analysis vs the RCPL
The starting point of this investigation is based on the published list of 33 PV values, for 31 of which in vitro data were available.The same analysis on KS and h-CLAT data only (EQ6) is shown in ESM2 3 and led to completely congruent conclusions.

Regression analysis vs the list with DSA04 values
Having collected the DSA04 values along with the in vitro data raises an obvious question: Given that the models differ significantly whether we predict the EC3 or the PV -and given that the PV is still significantly influenced by the large number of EC3 values used to derive PV values next to the human datawould we arrive at more human-relevant models when using only the human DSA04 values and the information on the human non-sensitizers?Thus, the same analysis as above was repeated a third time, this time only including the subset of data with human DSA04 values or a negative human sensitization status (n = 62).This analysis is shown in Table 6, and similar comparisons as above can be made, with three key observations: First, comparing the models for predicting EC3 (e.g., EQ1f vs EQ1) with the published models indicates that we obtain significantly different LLNA models on this set of chemicals.Thus, on this dataset the reactivity parameter has almost no weight to predict the LLNA EC3 and the coefficient for the MIT is reduced Third, the highest R 2 (74-75%) is found for the models integrating all data in EQ5.This is higher than in all published models and is true for both the pEC3 and the pPV.However, this may already partly be due to an over-fitted model.Thus, while the overall statistics are still highly significant, the coefficient is still statistically significant in EQ5b only for the KS EC1.5 and the MIT, and all the coefficients have a high standard deviation (data not shown).

Regression analysis vs the extended list of PV values
In Table 4, the same analysis as above is presented for the extended list (n = 139) with PV values (see Section 2 and ESM1 2 ).In all cases, highly significant regression models with R 2 values similar to those of the published models were derived.There are two important findings from this analysis: First, the regression models developed to predict the pEC3 based on these 139 chemicals are almost identical to the published models developed on n = 188-203 chemicals (see, e.g., EQ1c vs EQ1).This indicates that this set of chemicals can be considered representative of the full database, and we would not expect a major database bias for models developed for this smaller dataset.
Secondly, and maybe most importantly, the outcome is clearly different when evaluating against pPV values as compared to the evaluation vs the pEC3 target.More specifically, in all three input parameter combinations (EQ1, EQ4 and EQ5), the weight of the reactivity parameter, the KS EC1.5, and/or the h-CLAT MIT is increasing in parallel to a large drop in the weight of the cytotoxicity coefficients (see, e.g., EQ1d vs EQ1c).This confirms the observation reported above on the much smaller RCPL set.Actually, the cytotoxicity parameters become statistically non-significant when evaluating vs the PV, but they are highly significant in EQ1 and EQ4 when evaluating vs the LLNA (Tab.5).Also, the volatility parameter becomes non-significant.Thus, in order to predict the pPV, a simplified equation (EQ5e) based only on the gold standard to predict human potency) only yields an R 2 = 43% in this curated dataset, and thus prediction of human potency by animal data is not better than the prediction of human potency by in vitro data: Constant Standard error = 0.17, p = 0.017 pEC3 Standard error = 0.11, p < 0.0005 n = 62; S = 0.78; R 2 = 43%; R 2 (adj) = 42.0

Predictive differences of the PV-derived models as compared to the EC3-derived models
Among the three sets of chemicals, the analysis on the extended PV list is of highest relevance: This dataset appears representative of the larger dataset analyzed earlier.At the same time, this dataset leads to the same conclusion as the analysis on the other subsets, namely that integrating human data yields different predictive equations, with less weight for cytotoxicity parameters and higher weight for reactivity and cell activation in both KS and h-CLAT.Thus, the key question is how the prediction will be changed for individual chemicals by, e.g., applying EQ5e based on kDPRA, KS and h-CLAT and trained on PV values instead of the published EQ5.To illustrate the difference, the predicted potency using these two models for all 139 chemicals with PV values is plotted in Figure 2.Even though the new regression equation EQ5e, trained on the PV values and hence excluding the cytotoxicity parameters, is significantly different from the published EQ5, the predictions with these two equations correlate well (R 2 = 0.94; slope = 1.02 and y-intercept = 0.04, indicating the regression line is very close to the line of identity).This indicates that this additional analysis integrating the human evidence in the WoE PV values leads to a surprisingly small change in the prediction of skin sensitizer potency for individual chemicals.Still, on a more granular scale, interesting differences can be noted: (i) for the chemicals for too, while a higher weight for both cytotoxicity parameters and KS EC1.5 is found.This indicates that this chemical set, similar to the RCPL, is not representative of the full database.The observation that cytotoxicity becomes a dominant predictor for pEC3 and reactivity appears almost irrelevant is contrary to our knowledge on the AOP and further underlines the notion that this appears to be a biased data subset.
Next, when comparing the models for the pEC3 target vs the models for the pDSA04 target on these n = 62 chemicals, a difference in the weight of the parameters is noted as in the above analysis comparing the targets pEC3 and pPV.Thus, the weight of the reactivity parameter -although remaining low -is increased, the coefficients for the EC1.5 and the MIT are strongly increased in all three cases (EQ1, EQ4 and EQ5), and again, the weight of the cytotoxicity parameters is strongly reduced and becomes insignificant or even negative in all cases (see, e.g., EQ1g vs EQ1f).
Third, we note overall low R 2 values (32-45%) when fitting models to pDSA04 values, while the R 2 is only marginally reduced with this dataset when evaluating against pEC3 (53-60%) as compared to the published models on the full dataset (R 2 = 61-62%).This might offer a rather disturbing truth, namely that the validated in vitro methods are rather poor predictors of the human sensitization potency as compared to their ability to predict the LLNA.However, this conclusion may be a shortcut: Thus for the same dataset also a correlation analysis of the pDSA04 against the LLNA pEC3 was made, indicating that actually the prediction of the human data by the LLNA (which is considered the two assessments are summarized in Table 7.This list includes only six chemicals.The omission of the cytotoxicity parameters in EQ5e clearly renders the LLNA false-positive surfactant SDS a predicted non-sensitizer.On the other hand, two weak and four moderate sensitizers in the LLNA are predicted as more potent by the PV model.This is better aligned with the in vivo data in four cases but leads to over-prediction in the case of formaldehyde and ethyl acrylate.

Application in practice
For a straightforward prediction of the sensitization potency, our previous publication provided a calculation spreadsheet to directly derive a PoD from the quantitative in vitro data.Here an updated prediction spreadsheet is provided in ESM3 4 , including the additional regression equations derived from the extended PV list.Users will thus have multiple PoD calculated based on LLNA or PV trained models and based on the different input pa-which a strong sensitization potency is predicted (pPV or pEC3 is > 2.3 corresponding to an EC3 of around < 1%), the predictions are almost equivalent, and all data points are close to the line of identity; (ii) for moderate sensitizers (pPV / pEC3 ≈ 1.3-2.3corresponding to an EC3 of ≈ 1-10%), there is a tendency that more points are above the regression line, indicating that the new equation EQ5e is slightly more conservative for moderate sensitizers; (iii) the highest scattering on both sides of the regression line is observed for the weak sensitizers and non-sensitizers (pPV / pEC3 < 1.3), with a tendency that more points are below the regression line, indicating that EQ5, which includes the cytotoxicity parameters, is more conservative and predicts a higher sensitization potency for these weak and non-sensitizing chemicals.
The actual predictions by both EQ5 and EQ5e for all individual chemicals along with the PV and EC3 values is given in the supporting information, and here only the cases with the highest discrepancy (> 3.3-fold difference, i.e., > 0.5 Log-unit) between

Discussion
The goal of NAM-based in vitro testing for skin sensitization is to provide the necessary tools for risk assessment, i.e., to be able to derive safe use levels for chemicals in the absence of animal data.Such a risk assessment needs to be human-relevant, and one may therefore start with the naïve assumption that the assessment ultimately should be trained on human data only.However, as indicated in Section 1, there are pros and cons of using human data in terms of data availability, data quality, and availability of dose-response information.Thus, the present analysis is focused on an empirical and statistical assessment of the use of human data to determine the best way to integrate them into the assessment, and it builds on the work on the RCPL list (Irizar et al., 2022).
The data analysis follows our previous work on skin sensitization potency assessment exclusively based on OECD-accepted in vitro assays and focused on the LLNA as target (Natsch and Gerberick, 2022a,b), which itself was a continuation of a previous analysis using a different peptide reactivity assay (Natsch et al., 2018).
There are four key learnings from this in-depth analysis on different in vivo targets and different chemical sets which will be discussed separately: (i) Adding human data in the form of PV to an extended dataset appears a better option than training models exclusively on the available human data due to bias and limitations in the human dataset.(ii) Integrating human data into the assessment leads to models with a lesser weighting of cytotoxicity, which -based on a rameters.How to best select the most relevant result (EQ1, EQ4 EQ5 or EQ6) depending on (partial) data availability has been discussed in detail in our previous publications (Natsch and Gerberick, 2022a,b).The additional, PV-derived PoD values may be used in a similar way.In a conservative assessment, users may select the lower PoD from either the LLNA-or the PV-based models, unless there is a mechanistic reason to prefer one over the other.Thus, e.g., in the case of a strongly cytotoxic molecule with only a weak sensitization alert and weak experimental reactivity, preference may be given to the models trained on the PV as this may avoid a bias introduced by cytotoxicity, since cytotoxicity may rather reflect the irritant potential of that molecule.Vice versa, in case of strong activation of cellular markers significantly below the cytotoxic level, and especially in the presence of structural alerts for reactivity and/or experimental reactivity, the more conservative PoD derived from the PV models may be preferred.However, as shown above, the difference in most cases will be rather small.The prediction spreadsheet also provides two case studies on 2,4-dinitro-chlorobenzene (DNCB) and cinnamic aldehyde, and the results for these two chemicals are shown in Table 8.For these data-rich molecules with congruent in vitro results (kDPRA, KS and h-CLAT are all positive), the PoD values derived by all published models but also by the new PV-derived models are very close, indicating that the assessment based on different subsets of input parameters and based on different targets (LLNA or PV) is highly robust.All the models under-predict the PV of DNCB, which is based on a strong sensitization potency observed in humans, while they are more aligned with the LLNA potency of cinnamic aldehyde, which has a relatively high PV based on human predictive tests.and it is thus highly consistent.The resulting modification of the regression model in EQ5e with the omission of the cytotoxicity parameters has the strongest impact on the prediction of the weak/non-sensitizers, and it points to an important mechanistic difference between LLNA and human potency data, with cytotoxicity / irritation being a confounder for the weak / non-sensitizers in the LLNA.Interestingly, in a Bayesian statistics approach for predicting LLNA potency classes, the mutual information of the cytotoxicity parameters was strongest for the prediction of non-sensitizers followed by weak sensitizers, whereas they had the least mutual information with the strong sensitizers (Jaworska et al., 2015), which is in line with our observations.Skin irritation as a confounding factor in the LLNA had been discussed in detail when assessing the LLNA for hazard assessment (Ball et al., 2011;Garcia et al., 2010;Kreiling et al., 2008;Natsch et al., 2023), and the relationship between irritation and cytotoxicity is well established (Muller-Decker et al., 1994).The statistical analysis provided here independently again indicates an important effect of cytotoxicity to predict a weak sensitization potency in the LLNA, which may not perfectly mimic the human situation in all cases.Cytotoxicity and resulting irritation may be a key factor causing false-positives in the LLNA, especially for chemicals only positive at high concentrations, and this was recently discussed in detail (Natsch et al., 2023).On the other hand, a positive LLNA result cannot be questioned based only on cytotoxicity as also true skin sensitizers are often irritants.Identifying LLNA false positives thus requires a WoE assessment including cytotoxicity, lack of structural alert, and lack of human sensitization data on the target chemical or related analogues.
(iii) Effect of integrating human data on predicted potency Maybe the most important question is the practical one -Does the integration of human data significantly change the risk assessment we would make based on LLNA-trained models such as the ones we provided before (Natsch and Gerberick, 2022a,b)?This question may also relate to any existing or future model solely based on LLNA potency data.The comparison of the predictions obtained from the different equations EQ5 and EQ5e indicates: The effect is surprisingly small, and overall, predictions are quite similar.This indicates that the models derived from LLNA data only are quite robust, but we could now provide additional models in the calculation spreadsheet (ESM3 4 ) which can be used in a WoE approach, taking into account the learning that for some cytotoxic compounds (which may be over-predicted by the LLNA models) or for compounds strongly activating the cellular markers in the absence of cytotoxicity (which may be under-predicted by the LLNA model) it may make sense to use this additional information in a WoE and select the PoD accordingly.
(iv) The extended PV list as a tool for modeling Finally, this work provides an extended list of chemicals with PV assessed according to the general RCPL principles.This list is not meant to replace the established lists for an assessment of predic-purely statistical analysis -points to important mechanistic differences in human and LLNA data.(iii) While the human data affect the weight of the input parameters, there is a surprisingly small effect on the individual predictions, indicating a high robustness of the regression models for application in practice.(iv) The extended RCPL list may be used to train models integrating human data in a more comprehensive database.

(i) Human data as target
The analysis of the dataset (n = 62) with human data only showed that this set of chemicals is biased.Thus, when ignoring the human data and just predicting the pEC3 with models trained on this subset, the models deviate strongly from the models on the extended PV list (n = 139) and from the published models (Natsch and Gerberick, 2022a) (n = 188-203) and also on the larger dataset included in supporting information (ESM45 ) to the previous analysis (Natsch and Gerberick, 2022a) (n = 317, EQ22).Statistical analysis of all those larger datasets points to a very consistent and strong contribution of the Cys-peptide reaction rate to the potency next to the contributions from KS and/or h-CLAT, in perfect alignment with the sensitization AOP (OECD, 2014).All these larger datasets led to almost identical regression coefficients in an analysis vs the pEC3.However, the contribution of reactivity is lost when predicting the pEC3 for the human dataset only (see EQ1f and EQ5g in Tab. 6).This shows that -completely independent of the human data and the human data quality -the collection of chemicals for which we can make this human assessment is a biased set and not a representative subset of the database.This finding may not only be relevant to the modeling approach by the regression models, but also to all other attempts to derive human-relevant potency models relying on this human dataset.
Next to this dataset bias (which is independent of the quality of the human data), we also may have a certain reservation about the quality of the human data.The rather poor prediction (as expressed by low R 2 ) of the human sensitization potency by both the LLNA and the in vitro data may indicate that the quantitative human data may not be ideal in terms of data quality, which may be related to the limited dose-response information and the historical nature of most of the test results coming from not fully standardized protocols.This further indicates that these data should be used in a WoE and not as sole predictive target.Thus, using these data in the form of a WoE PV as proposed by the RCPL workflows (Irizar et al., 2022) appears a balanced way forward for integration of the human evidence.
(ii) Importance of cytotoxicity parameters From a mechanistic angle, the most interesting finding is that moving from an LLNA-only based assessment to the PV or the DSA04 as a target leads to an almost complete loss of the influence of the cytotoxicity parameters.The same effect is observed on all three sets of chemicals (Tab. 3,4 and 6) and on all four combinations of input parameters (including EQ6, see ESM2 3 ), tive capacity for hazard assessment such as the DASS database but rather for modeling potency.It may be further refined, but it could be a useful tool going forward since training potency models exclusively against human data appears no ideal solution (see point (i) above).On the other hand, it is not recommended to train models on the narrow set of chemicals of the RCPL: Thus, EQ5b trained on the RCPL has the highest predictivity for the RCPL (R 2 = 74%; Tab.3), while published EQ5 has an R 2 = 67% for this set of chemicals (data not shown).However, EQ5b trained on the PV list has an R 2 of 61% on the entire dataset (data not shown) and performs worse than EQ5d trained on the full dataset (R 2 = 64%; Tab.4), indicating that training on the RCPL will not lead to the best model to represent the complete dataset.This is because the chemical set of the RCPL is not representative of the full dataset.

Outlook
Based on the shortcomings of the human dataset, a key question remains: Should further models for potency assessment always include human data?Based on the current analysis, there are important mechanistic learnings from the human data in terms of weighing irritation and cytotoxicity, yet the human data do not lead to a potency model with a substantially different final outcome -which is reassuring for risk assessors.Future attempts may thus apply the extended list of PV to test the robustness of models trained entirely on LLNA data in order to not miss mechanistic aspects, yet the limitations of the human data highlighted here also indicate that the human database should not be overemphasized when predicting sensitizer potency.

Fig. 1 :
Fig. 1: DSA04 values derived from the DASS database and using the weighted geometric mean in cases where multiple studies were available as compared to the DSA04 values used in (a) the RCPL list or (b) DSA04 values calculated from the ICCVAM DSA05 values reported in Bil et al. (2017)

Fig. 2 :
Fig. 2: Predicted skin sensitizer potency based on kDPRA, KS, and h-CLAT with the published EQ5 trained on LLNA data vs the predictions with EQ5e trained on the PV values and hence excluding the cytotoxicity parameters ., Interagency Coordinating Committee on the Validation of Alternative Methods; IFRA, International Fragrance Association; ITS, integrated testing strategy; kDPRA, kinetic direct peptide reactivity assay; KS, KeratinoSens™; LLNA, local lymph node assay; NAM, new approach methodology; NESIL, no expected sensitization induction level; NOEL, no observed effect level; PoD, point of departure; PV, potency values; OECD, Organisation for Economic Co-operation and Development; QRA, quantitative risk assessment; RCPL, Reference Chemical Potency List; RIFM, Research Institute for Fragrance Materials; SDS, sodium dodecyl sulphate; TG; test guideline; WoE, weight of evidence AbbreviationsAOP, adverse outcome pathway; DA, defined approach; DASS, defined approaches for skin sensitization; DNCB, 2,4-dinitro-chlorobenzene; DPRA, direct peptide reactivity assay; DSA, dose per surface area; ECHA, European Chemicals Agency; EC3, estimated concentration for threefold cell proliferation; GHS, Globally Harmonized System; GPMT, guinea pig maximization test; h-CLAT, human cell line activation test; HMT, human maximization test; HRIPT, human repeat insult patch test; IATA, integrated approaches to testing and assessment; ICCVAM,

and 38 NS This list was used for the key (this work; analysis in Tab. 4, 108 additional see Tab. 2) chemicals with PV and in vitro data as compared to RCPL with in vitro data (n = 31)
a 16 additional chemicals are positive in the LLNA but lack EC3 values in the DASS DB; b Chemicals with kDPRA, KS, and h-CLAT data were considered for inclusion in the WoE list; c S, sensitizers; NS, non-sensitizers; d Three of the human sensitizers have DSA04 values > 25,000, hence their potency is set equal to the non-sensitizers; here they are counted with the sensitizers. 3doi:10.14573/altex.2302081s2Natsch ALTEX 40(4), 2023

Tab. 2: Summary of the extended list of chemicals with PV values developed for this work Contained in WoE sensitizer status PV value is derived from n
Table 3 indicates the results of the regression analysis on this a If EC 3 is less than 2-fold different from DSA04, EC3 is taken as PV value in RCPL workflow.b For three chemicals the DSA04 may underestimate the human sensitization potential, and therefore according to workflow #2 the lower EC3 is used as PV (phenyl benzoate, tetramethylthiuram disulphide, 2-mercaptobenzothiazol). c For 3 chemicals the selected PV is the human NOEL (α-amylcinnamic aldehyde, resorcinol, α-methylcinnamaldehyde; workflow #3) 577 KS EC1.5, h-CLAT MIT, and k max and completely excluding cytotoxicity can be derived for the extended PV list which has an identical R 2 = 63%: EQ5e: pEC3 = 0.21 + 0.38 × Log k max norm + 0.31 × Log MIT norm + 0.21 × Log EC1.

Tab.3: Regression coefficients and statistics for predictive models trained on the RCPL dataset
Regression coefficients for the different parameters are given that fully define the regression equations as shown in the text for EQ5.New regression models were calculated for the specific dataset and different in vivo targets (PV or EC3 values).d Number of chemicals used to train the model.e Numbers and lowercase letters to differentiate the new equations.EQ1, EQ4 and EQ5 are the published equations.
a The published equations are shown in bold.bThesame analysis vs pEC3 as for the published EQ1 but only for the 31 RCPL chemicals.c

Tab. 4: Regression coefficients and statistics for predictive models on the extended PV list
a See footnotes of Table3.Tab.5: p-