The Impact of Biostatistics on Hazard Characterization Using In Vitro Developmental Neurotoxicity Assays

In chemical safety assessment, benchmark concentrations (BMC) and their associated uncertainty are needed for the toxicological evaluation of in vitro data sets. A BMC estimation is derived from concentration-response modelling and results from various statistical decisions, which depend on factors such as experimental design and assay endpoint features. In current data practice, the experimenter is often responsible for the data analysis and therefore relies on statistical software, often without being aware of the software default settings and how they can impact the outputs of data analysis. To provide more insight into how statistical decision-making can influence the outcomes of data analysis and interpretation, we have developed an automated platform that includes statistical methods for BMC estimation, a novel endpoint-specific hazard classification system, and routines that flag data sets that are outside the applicability domain for an automatic data evaluation. We used case studies on a large dataset produced by a developmental neurotoxicity (DNT) in vitro battery (DNT IVB). Here we focused on the BMC and its confidence interval (CI) estimation as well as on final hazard classification. We identified five crucial statistical decisions the experimenter must make during data analysis: choice of replicate averaging, response data normalization, regression modelling, BMC and CI estimation, and choice of benchmark response levels. The insights gained are intended to raise more awareness among experimenters on the importance of statistical decisions and methods but also to demonstrate how important fit-for-purpose, internationally harmonized and accepted data evaluation and analysis procedures are for objective hazard classification.

ment, which aims at using new approach methods (NAMs) for exposure-based, hypothesis-driven risk assessment without the generation of new animal data (Li et al., 2021;Dent et al., 2021;Pallocca et al., 2022).
Typically, an in vitro HTS test system produces hazard data for a relatively large number of test concentrations and thus makes it most suitable for concentration-response regression modelling.This statistical approach allows the interpolative estimation of a concentration value at a given effect level (effect or inhibitory concentration), and specifically of the benchmark concentration (BMC) and its associated uncertainty, expressed as lower limit of a one-sided 95% confidence interval (BLL).In analogy to the

Introduction
In 2007, the National Research Council (NRC) of the United States proposed a new strategy for toxicity testing in the 21 st century centering around a shift from in vivo experiments in animals to mechanism-based in vitro testing (NRC, 2007).Since then, major advances in the field of in vitro toxicology have been made, including the development and establishment of medium and high throughput screening (HTS) assays as well as bioinformatics tools for data generation, management, and analysis (Leist et al., 2014;Wheeler et al., 2015;Villeneuve et al., 2019).These efforts are contributing to next generation risk assess-
The BMC estimation consists of various statistical decisions to be made in the concentration-response analysis, which depend largely on the experimental design, the concentration-response data, and assay endpoint features, and which require statistical expertise that is usually only held by experienced biostatisticians.In current data practice, the experimenter is often responsible for the data analysis and therefore relies on statistical software without being aware of the software default settings and how they can impact outputs of data analysis (Jensen et al., 2020).Existing guidelines for concentration-response data analysis are often too general (OECD, 2006;EFSA, 2017), and no clear consensus on a common and standardized biostatistical method for in vitro toxicity data has been achieved (Wheeler et al., 2015;Sand et al., 2017).
To provide more insight into how statistical decision-making can influence the outcomes of data analysis and interpre-benchmark dose (BMD) approach for in vivo studies, a BMC is considered the lowest concentration of a test compound that produces a pre-defined small "relevant" change to the control reference's response level (Crump, 1995;Krebs et al., 2020), and therefore, the benchmark response (BMR) value should be as "close as possible" to the control response.
In vitro test systems represent a huge variety of different types of assays, from cell-free, cell and tissue-based methods up to multi-response organoid systems, and therefore, concentration-response data vary enormously among these systems with respect to their test-specific experimental designs, data variability, dynamic ranges, and concentration-response patterns.Unique to HTS systems is also that assay outputs are produced in microplate multiwell readers, with concentration-response data from the same concentration and experiment considered to reflect technical (intra-replicate) variation and data from repeated experiments more indicative for "biological" (between-study) variation.These hierarchical data are usually simplified by using an average response value per test concentration and experiment (replicate average) as Fig. 1: Study overview Several biostatistical data analysis and evaluation steps were analyzed for their impact on a BMC estimation and subsequent hazard characterization from developmental neurotoxicity (DNT) data: i) how to average replicate responses from an experiment, ii) how to normalize concentrationresponse data, iii) how to describe concentrationresponse data by regression modelling, iv) how to estimate a benchmark concentration (BMC) and its uncertainty, and v) which benchmark response (BMR) level to select.Changes caused by the use of different statistical methods were recorded for 148 compounds tested on up to 22 assay endpoints, and their impact translated into the compound's DNT hit classification and the predictivity performance of the overall assay battery.
BMC estimation, or a higher BMR, which guarantees a statistically more robust BMC estimation but might fail for compounds that produce weak responses below the intended BMR?We designed a standard data evaluation protocol ("standard protocol") and compared its outputs with the BMCs and confidence intervals (CIs) estimated using alternative statistical methods for the same DNT IVB data.The alternative statistical methods were chosen along the questions outlined in (i) to (v).This was supplemented by measuring their impact on hazard alerts derived from hit classifications, which separate cytotoxic concentration ranges from the respective BMC of the specific DNT endpoint, and by measuring their impact on the DNT IVB's ability to predict DNT adversity in terms of specify, sensitivity, and accuracy.

DNT data
All concentration-response data used in this study are from a DNT IVB of 8 assays with 22 endpoints, in which a total of 148 compounds were tested.120 compounds were tested across all assays, while 28 compounds were tested in at least 2 assays.Fourteen assay endpoints represent major key neurodevelopmental processes, and 8 endpoints measure general cell viability and cytotoxicity (Tab.1).This DNT IVB was developed in collaboration with EFSA with the aim of advancing the application of in vitro DNT testing for regulatory purposes.The term "BMC" was used equally for data from DNT-specific, cytotoxicity and viability endpoints.
Depending on the assay, fluorescent readouts using a multiplate reader or fluorescence and brightfield imaging with subsequent artificial intelligence-based image analysis (Schmuck et al., 2017;Förster et al., 2022) was performed as endpoint assessment.NPC assays were conducted with three to five independent experiments and 5 replicates each, UKN assays with three independent experiments and 6 replicates each (Tab.S11 ).Each compound was tested in at least eight concentrations per experiment.An overview of the assays, the cell models, and the respective endpoints is given in Table S1 1 , and more detailed information about the assay-specific experimental testing procedures and test outcomes can be found elsewhere (Masjosthusmann et al., 2020;Crofton and Mundy, 2021;Blum et al., 2022).
The BMR for each endpoint was derived from the between-experimental variability as the coefficient of variation of median plate medians (after normalization) measured at the lowest test concentration and across all independent experiments (Masjosthusmann et al., 2020).To achieve a better comparability across the endpoints, the BMRs were then rounded to the next higher value, resulting in three BMRs: a 10% change was selected for endpoints from the NPC2a and NPC1-5 cytotoxicity assay (BMR10), and a 30% change for endpoints from the NPC1, NPC2a, NPC2b, NPC3-5, and NPC1-5 viability assay (BMR30).tation, we have used case studies on a large dataset produced by a developmental neurotoxicity (DNT) in vitro battery (DNT IVB; Masjosthusmann et al., 2020;Crofton and Mundy, 2021;Blum et al., 2022).In this DNT IVB, 148 compounds were tested across up to ten test methods representing the neurodevelopmental key events (KE) of neural progenitor cell (NPC) proliferation, migration of neural crest and radial glia cells, neurons and oligodendrocytes, neuronal differentiation, neurite outgrowth of peripheral and central nervous system neurons, as well as oligodendrocyte differentiation, and accomplished by various endpoints measuring cell viability and cytotoxicity (Masjosthusmann et al., 2020;Blum et al., 2022).Some of the DNT-specific endpoints are derived from primary and organotypic cultures which mimic a system of high physiological complexity and cell type heterogeneity, and thus are more prone to a data variability typically observed in animal studies.Here we focused on the BMC and its confidence interval (CI) estimation, as well as the final hazard classification.For this purpose, we identified five crucial statistical decisions the experimenter must make during the data analysis (Fig. 1): (i) Replicate distribution and choice of location parameter for regression analysis: Shall the median of all replicate responses of an experiment be used, which makes no assumption on the data and thus reduces the negative impact of potential data outliers, or the replicate mean, which is more efficient when the replicate responses follow a symmetric distribution but, if violated, can lead to a biased estimation of the replicate mean?(ii) Response data normalization: Shall the responses of an experiment always be normalized to the control's response even if the exposure responses provide clear evidence against the use of control data, or shall in that case the "control reference" be estimated directly from responses observed at low exposure concentrations ("re-normalization", Krebs et al., 2018)?(iii) Regression model: Shall the concentration-response data always be described by the same and supposedly flexible mathematical model, or is it better to use several models and either subsequently select the best model by means of goodness-of-fit criteria ("best-fit method", Scholze et al., 2001) or estimate an average of all model fits ("model averaging", Claeskens et al., 2008)?(iv) Uncertainty of a BMC estimation: Shall the confidence level of a BMC be calculated by a simple and commonly used statistical approximation technique ("Delta method", Cox, 1990), which can lead to a less accurate confidence interval (Moerbeek et al., 2004), or by alternative approaches such as bootstrapping or inverse regression (Jensen et al., 2019)?(v) Benchmark response (BMR): Shall a response level most close to the control reference be selected, which might not always be applicable for the statistical concentration-response analysis and thus might fail to provide a reliable differentiation in the NPC3 assay is calculated as the number of neurons divided by the total number of cells with a nucleus:

Replicate averaging
The average assay response for controls and treatments from the same experiment was estimated either by the arithmetic mean or by the median.The variability between replicates was calculated as standard deviation (SD; for the mean) or as median absolute deviation (MAD; for the median).Outlier detection procedures were not applied, and data points from wells where technical problems were known or obvious (e.g., scanned images were blurred or empty, staining did not work properly on all cells) were excluded from the data analysis.

Effect data normalization
CRStats offers different normalization methods, which allows the translation of pre-processed effect data into relative values.
For this study, we used the following two methods (Krebs et al., 2018)

Significance analysis
The presence of at least one exposure concentration that has produced an effect response which differs statistically significantly from the responses of all remaining exposures is a crucial factor in the hazard classification method (Section 2.2.8).Such significant differences between treatment means were identified using the Tukey Honest Significant Differences test (alpha = 5%, two-sided) (Tukey HSD; Yandell, 1997) with hypothesis testing conducted on normalized replicate averages from at least three independent experiments.Based on sample size (≥ 3 indepen-For all UKN assays a 25% change was chosen (BMR25), and for the viability of the UKN2, a BMR10 was selected.

Data evaluation platform
For data processing and evaluation, the R package drc2 (Ritz et al., 2019) was extended and optimized for the use of data from multi-well plate experiments.The biostatistical data evaluation software CRStats 3 and an interactive R markdown document are both freely available.We defined a standard protocol for the evaluation of DNT IVB data using the following statistical methods: (i) average replicate per experiment estimated by median (Section 2.2.3), (ii) control-normalization followed by re-normalization (Section 2.2.4), (iii) application of several mathematical models to find the "best-fit" regression model for a BMC estimation (Section 2.2.6), (iv) CI estimation of the BMC by inverse regression (Section 2.2.7), and (v) selecting endpoint-specific BMRs for hazard classification as outlined in Table S1 1 .

Minimal data requirements
Data were accepted for data analysis only if the following three minimal data requirements were fulfilled: (i) at least two replicates per concentration were available, otherwise all readouts from this concentration were excluded, (ii) at least five concentrations per experiment provided readouts, otherwise the whole experiment was excluded, and (iii) at least two control replicates were available, otherwise the whole experiment was excluded.

Pre-processing
CRStats uses different assay-specific pre-processing steps to obtain a single response value for each well.For example, neuronal sion curve, i.e., the intersection of the horizontal BMR line with the lower and upper 95% CIs of the regression fit determined the BLL and BUL.This approach puts much emphasizes on a successful regression fit in terms of robustness and reliability, rests on a Wald-based confidence interval, which is only asymptotically valid, and assumes that the model parameters are close to being unbiased and normally distributed.The delta method is an asymptotic approach that combines information of the estimated model parameters to derive a Wald-type interval (Jensen et al., 2020).Bootstrapping uses computer-intensive simulation techniques that resample the original dataset to create a huge number of so-called bootstrap samples, with each sample mirroring the original data set with an identical experimental design but newly simulated effect responses.The same statistical data analysis was performed on each bootstrap sample, resulting in a distribution of resampled BMC values around the original BMC estimation.If the median of this distribution equals the original BMC (unbiased resampling), then the 2.5% and 97.5% quantiles are expected to mirror the BUL and BLL of the original BMC, respectively.For each bootstrap sample, the same regression model was used as part of the best-fit method or one model-averaged BMC if model averaging was performed.To simplify the model averaging method, only three regression models were considered (four-parameter loglogistic, four-parameter Weibull, and three-parameter exponential model).Bootstrapping was always conducted on 1000 resampled datasets, and due to the small sample sizes, we used the parametric version (Efron and Tibshirani, 1994).All resampling was performed using the function bmdMA of the R package bmd (Jensen et al., 2020).Bootstrapping can simulate a bootstrap sample that does not allow a BMC estimation or that leads to an unreliable BMC estimation well outside the tested concentration range.Therefore, a resampled BMC was excluded from the resampling distribution if it was 1.5-times above the highest test concentration or below the lowest test concentration.
To allow a better comparison between BMCs from different data scenarios, the BMC was transformed to a relative BMC on a log10 scale by relating the 100-fold BMC estimation to the highest test concentration of the data set: A relative BMC of 1 corresponds to a BMC that is tenfold below the highest test concentration of the data set, and a relative BMC above 2 corresponds to a BMC that has been extrapolated beyond the highest test concentration.The lower the relative BMC value, the more likely the estimation is supported by effect data from more concentrations.

Hazard classification
CRStats uses a hazard classification approach that judges whether data evidence is sufficient to define a compound as active for the specific DNT endpoint and whether this can be distinguished from an activity observed in cell health related endpoints (viabil-dent experiments) and family-wise error rate (≥ 5 concentrations) considerations, we expected the statistical limit of detection to match the effect size of the endpoint-dependent BMRs.As an average control value was always set to 100% (Section 2.2.4), controls were excluded from the significance analysis.Data provided no evidence against the Gaussian assumption.

Concentration-response regression analysis
The R packages drc (Ritz et al., 2015) and bmd (Jensen et al., 2020) were used for regression analysis and the estimation of a BMC and its associated uncertainty.The drm function fits a pre-defined regression model to the concentration-response data, with several options implemented to provide more flexibility for the estimation method.A large number of mathematical nonlinear regression functions was applied to the same data set (Tab.S3 1 ), and the best fitting model was then selected based on Akaike's information criterion (AIC) ("best-fit method", Scholze et al., 2001;Portet, 2020).AIC is commonly used to compare the relative goodness-of-fit among different models and to then choose the model of best predictive power by balancing data support against model complexity.As all effect endpoints in this study are continuous, the ordinary least-squares (OLS) method was used.OLS relies on two assumptions: (i) effect data (here replicate average) follow a symmetrical distribution and (ii) variance is homogenous across all treatment groups.Both assumptions were checked prior to data analysis on the basis of pooled endpoint-specific data from all experiments (Breusch-Pagan test for heteroscedasticity, Breusch et al., 1979; Triples test for symmetry, Randles et al., 1980).Data variability differed on average by up to 20% between the treatment groups, with the highest variability often occurring at the highest test concentration, and there was no overall clear evidence that normalized replicate means did not follow a symmetric distribution.These findings were deemed acceptable for using the unweighted OLS regression analysis.Count data from assay endpoints were considered continuous as counts were well above zero.

BMC and its uncertainty
In the standard protocol the BMC was estimated directly from the best-fit model.We considered model averaging as an alternative option where, similar to the previous best-fit method, a number of suitable concentration-response models are fitted to the same data, but in this case, all resulting model fits were combined to provide a weighted average of BMC estimates (Ritz et al., 2013).Uncertainty was always expressed as α = 5%, i.e., the lower limit (BLL) corresponds to the 2.5% limit and the upper limit (BUL) to the 97.5% limit.BLL and BUL were derived by three different methods, i.e., inverse regression, the delta method, and bootstrapping.The estimation of the BMC and its 95% CI by model averaging was always performed in combination with bootstrapping.Inverse regression was used after having fitted the regression model to the data (Buckley et al., 2009;Fang et al., 2015).Here we used a simple, pragmatic approach by anchoring both BLL and BUL directly to the 95% CI of the regres-directionality of the observed concentration-response pattern (i.e., either reduction or inhibition).Common to all decision trees is that they compare the BMC of the DNT-specific endpoints to the respective BMC of the unspecific endpoint (i.e., cytotoxicity or cell viability).For the NPC and UKN assays slightly different versions were developed, with all NPC assay endpoints accounting directly for the statistical uncertainties of both BMC estimations by using their corresponding CIs, and all UKN assay endpoints using pe-defined acceptance ranges instead.The principles of the hazard decision tree for data sets with a decreasing concentration-response pattern (reduction) measured in NPC assays (NPC1, NPC2, NPC3 and NPC5, Tab.S1 1 ) are shown in Figure 2, and for an increasing concentration-response pattern (induction) in Figure 3. Inductions are handled separately because the specific and unspecific endpoints do not have the same relationship during an induction compared to a reduction in the endpoint.A loss in general cell viability, for example, will likely result in an effect on cell proliferation, while an induction in cell viability does not necessarily increase cell proliferation.If migration (NPC2a) is affected, only cytotoxicity is used as a reference for all specific endpoints of NPC2-5.A reduction in migration also reduces cell viability due to the lower number of cells in the migration area and not necessarily due to cell death.If so, it cannot be used as a valid reference to discriminate between a specific and unspecific ef-ity and cytotoxicity).Accordingly, the endpoint-specific hazard of a compound is classified into five categories: − No hit: There is no observed effect on the DNT-specific endpoint or on general cell health.− Unspecific hit: The effect on the DNT-specific endpoint cannot be separated from an effect on the cell health related endpoint.− Borderline hit: The separation between the effects on the DNT-specific endpoint and the effect on the cell health related endpoint is statistically not clear (Leontaridou et al., 2017).− Specific hit: The effect on the DNT-specific endpoint is clearly separated from an effect on the cell health related endpoint.− Not identified: Data are incomplete und do not allow classification.If the automatic classification failed due to a high uncertainty of the BMC or a missing BMC for the cell health related endpoint, the classification was recorded as "expert judgement" and classification into one of these five categories was done by manual inspection according to a guidance document (see Section 1.4 and 1.5 in the supplementary file 1 for more details).Each classification was done independently by two experts with a high interrater reliability (Kappa statistic > 0.9).An overview over all flagging alerts leading to expert judgement is given in Table S5 1 .The guidance used for expert judgement is given in Figure S1 1 .
The hazard classification approach was operationalized by hazard decision trees that reflect specific assay features and the

Fig. 2: Decision tree for the NPC hazard classification of inhibitory effects
The decision tree shows for NPC1-5 data with a decreasing concentration-response pattern how BMC estimations and their uncertainty (expressed as 95% CI) for data from both specific and unspecific endpoints are used to classify the compound into one of the DNT hit categories (colored boxes).Hits with the category "expert judgement" (grey box) will be classified into one of the DNT hit categories by manual inspection on the basis of all data evidence.
A positive compound was considered as true positive if it was classified as a specific hit or borderline in at least one assay.

Results
The impact of different statistical methods was quantified by comparing outcomes of the standard protocol with those of the following alternatives: 1) average replicate per experiment estimated by the arithmetic mean, 2) control normalization without re-normalization, 3) using a three-parameter log-logistic regression model (LL3rm) for BMC estimation, 4) using model-averaging for BMC estimation, 5) CI estimation of the BMC using the delta method, 6) CI estimation of the BMC by bootstrapping, 7) CI estimation of the BMC by model averaging, and 8) increasing the endpoint-specific BMR by 20% (BMR30 and 50).Differences in the BMC estimation and its lower 95% CI, the endpoint-specific hazard classification of the compound, and assay performance were quantified and compared across the various specific assay endpoints.
In total, 148 compounds were tested on up to 14 DNT-specific and 8 cytotoxicity and viability endpoints, leading to a total of 2426 datasets of which 2385 (98.31%) fulfilled the minimal data requirements of the data evaluation pipeline.According to the standard protocol, it was possible to perform a regression analysis for 2385 data sets (1953 NPC and 432 UKN) and a hazard hit categorization for 1563 data sets from DNT-specific endpoints (1347 NPC and 216 UKN).Three-parameter models were suggested as best-fit choice for 55.9% of these data sets, the exponential function with two model parameters for 31.7%,fect.The same applies to effects on cell viability.In these cases, only cytotoxicity is used as general cell health reference for respective specific NPC endpoints.More details can be found in the supplementary material (Section 1.3 1 ) and in Table S4 1 , and details on the classification tree applied to data from the UKN assays can be found in Masjosthusmann et al. (2020).The lower and upper confidence interval of the BMC always refers to the central two-sided 95% confidence interval (BLL = 2.5 th percentile, BUL = 97.5 th percentile).

Assay performance
Of the 148 compounds tested in the DNT IVB, a set of 45 reference compounds (17 negative compounds that are known not to cause DNT; 28 positive compounds with proven DNT adversity in humans or mammals) was used for an evaluation of the DNT IVB predictivity.Hit decisions were derived from the hazard decision trees developed in 2.2.8, and the following performance parameters were used: Specificity = # true negative hits (Eq.5) # negative compounds Sensitivity = # true positive hit (Eq.6) # positive compounds Accuracy = # true negative hits + # true positive hits (Eq.7)

# negative compounds+ # positive compounds)
A negative compound was considered as true negative if it was not classified as a specific hit or borderline in any of the assays.

Fig. 3: Decision tree for NPC hazard classification of increasing effects
The decision tree shows for NPC1-5 data with increasing concentration-response pattern ("induction") how BMC estimations and their uncertainty (expressed as 95% confidence intervals, CI) for data from both specific and unspecific endpoints are used to classify the compound into a specific or no hit (colored boxes).The presence of a cytotoxic response can lead to an artefact in the DNT-specific endpoint and is therefore initially categorized as "expert judgement".These hits will be classified into one of the DNT hit categories by manual inspection on the basis of all data evidence.

Fig. 4: Impact of methodological changes in data evaluation on the BMC estimation
BMCs for 148 compounds tested on up to 22 endpoints from 8 assays were estimated using the standard protocol and alternative methods.(A-E) A relative BMC was expressed as the log10-transformed ratio between the 100-fold BMC and the maximum test concentration, and relative BMCs from all data sets and endpoints but different statistical methods were plotted against each other.The solid black line of identity indicates no difference between the relative BMCs, the grey interval around the line of identity indicates values within a three-fold range.Values outside this interval are considered as relevantly different between the methods.If a relative BMC could be calculated for only one method, the missing value of the opposing method is plotted as BMRnr area on the right or upper side of the graph.Relative BMCs are colored according to their bioassay endpoint.To indicate the strength of agreement between both data evaluation protocols, the goodness-of-fit coefficient from a trend regression analysis between both relative BMCs is included (R 2 ; top left), and the percentage of successfully applied regression models of the alternative protocol in relation to the standard protocol is shown top left ("fit success rate").el, which led to a noteworthy loss of successful regression fits (68.6% success rate).
To further explore differences between BMC estimates, the number of BMRnr cases that only occurred in the alternative protocol (i.e., the standard protocol resulted in a BMC while the alternative protocol did not; Fig. 4F, blue-shaded area of bar), the number of BMRnr cases that only occurred in the standard protocol (Fig. 4F, green-shaded area of bar), and large differences outside the belt ("outliers", Fig. 4F red shaded area of bar) were compared to the total number of BMCs that were estimated using the standard protocol.Most protocols that led to less successful BMCs were caused by the inability of the data to support the regression modelling for the intended BMR level.All alternative protocols together produced fewer BMCs but more BMRnr cases, with protocol changes to control normalization and higher BMRs resulting in the highest increase in BMnr cases (i.e., fewer BMCs), with an increase of 17.25% and 37.98% BMRnr cases, respectively.Taking only the cases with huge BMC differences into account ("outliers"), the number of BMCs that were either lost or gained due to the protocol change was further quantified: four-parameter models for 10.5%, and the most complex model (5-parameter general log-logistic) for 1.9%.

Impact of different data evaluation methods on the BMC estimation
The relative BMCs from the standard and alternative statistical protocols are shown in Figure 4A-E for five statistical parameters that were changed, with the BMC of the alternative protocol always referring to the x-axis and the BMC of the standard protocol to the y-axis.If a regression analysis could be performed but a BMC could not be established due to missing data support for the BMR, the BMC was flagged as "BMRnr" (BMR not reached) and included in the plot at the end of the BMR axes, i.e., a BM-Rnr value on the right side of the plot indicates a BMC estimation that was only possible using the standard protocol, and similarly, a BMC value at the top of the plot area indicates a BMC estimation that could only be established for the alternative protocol.Data sets for which none of the protocols were able to produce a BMC were excluded.Color-coded symbols refer to the 22 bioassay endpoints, and a data point on (or close to) the solid line of identity indicates perfect agreement between the BMCs from both protocols.Three-fold BMC differences are highlighted by a belt around the line of identity (i.e., values outside of the belt have above three-fold change), and the percentage number of successful regression fits for the alternative protocol is included on top of each plot, with reference to the 1953 data sets for which successful regression modelling was conducted according to the standard protocol.To identify general deviation patterns, we performed trend regression analyses between the relative BMCs, and the corresponding value of the goodness-of-fit criterion (R 2 ) is provided in the plot: the higher the coefficient, the more consistent the results between the two protocols.For the trend analysis, we set a relative BMC = 2.47 for a BMRnr, i.e., a 3-fold difference between the highest concentration and a fictional BMC was assumed.Not shown are BMC differences for the bootstrapping and delta method, as both refer to the same BMC and thus would have resulted always in identical BMCs in the plot.
We found the most profound BMC differences between the data re-normalization and control normalization (Fig. 4B), with an R 2 of 0.3.The main reason for the huge number of BMC disagreements is due to the high number of BMRnr's, i.e., regression fits that could establish a reliable BMC for the endpoint-specific BMR in only one of the protocols.Using the mean as replicate average instead of the median (Fig. 4A), using a predefined regression model (LL3rm) instead of the best-fit method (Fig. 4C), and using a higher BMR resulted in moderate BMC changes, with R 2 values between 0.59-0.61.The best agreement between relative BMC values was observed for the comparison between the outcomes from model averaging against the best-fit method (Fig. 4D) with an R 2 of 0.85.The number of datasets for which a regression model could be fitted for the alternative protocol was related to the number of fits for the standard protocol and expressed as relative "fit success rate".All changes of statistical methods led to similar success rates, with the exception of the sole application of the three-parameter log-logistic mod-

Fig. 5: Impact of methodological changes on the BMCL for DNT-specific assay endpoints
BMCLs for 148 compounds tested on up to 14 DNT-specific endpoints from 8 assays were estimated using the standard protocol and different alternative methods.BMCL differences are expressed as ratio (black dots), with variation shown by box and whisker plot (box = median ± interquartile range, whisker = 5-95 th percentiles) 1), we judged this method as too unreliable and thus excluded it from all remaining analyses.
Exemplary data sets are shown for three different classification scenarios: (i) a specific DNT hit decision for significantly inhibited oligodendrocyte differentiation at exposure concentrations above 0.25 µM, and only marginally reduced cell viability (marker for cytotoxicity) at 20-fold higher concentrations (Fig. 6A), (ii) an unspecific hit decision for significantly inhibited oligodendrocyte differentiation and cytotoxicity observed at the same concentration range (0.24 to 2.2 µM) (Fig. 6B), and (iii) a data scenario that was flagged for expert judgement because for the specific endpoint a weak but statistically significant effect reduction at the highest test concentration (20 µM) was observed which was not supported by a reliable BMC10 estimation.On closer inspection of the experimental data (Fig. 6C, with each color-coded symbol representing the replicate median from an independent experiment), it was decided that responses from both the specific and unspecific endpoint were not distinguishable, and thus the weak response reduction of the specific endpoint was classified as unspecific.
Figure 6D provides an overview of the total number of hit classifications that changed in response to changes to the standard protocol.Expressed as percentages and for each methodological change, the changes in hit classifications are further divided into "gains", i.e., percent increase in hazard hits in relation to the standard protocol, and "losses", i.e., percent decrease in hazard hits in relation to the standard protocol.Here, a change toward replicate averaging by mean, control normalization and bootstrapping caused the lowest number of classification changes (< 5%), followed by methodological changes towards model averaging or higher BMR levels, which led to almost 7% different hit classifications.Here, model averaging increased the number of "not identified" classifications by 2.6%, mostly at the cost of "no hit" classifications, and the higher BMR levels led to 4.9% more "no hit" classifications.The by far most severe changes in hit classifications were observed when only the LL3rm regression model was used to describe the experimental concentration-response data (45.9%total difference), which led to 42.0% more "not identified" classifications.The latter is most likely the consequence of unsuccessful regression modelling (and corresponding BMC estimation) due to lack of sufficient data support for this model.

Assay performance
To assess how changes in the data evaluation protocol might impact the evaluation of the DNT IVB's predictivity, 28 reference chemicals of known DNT and 17 negative control chemicals were selected (Masjosthusmann et al., 2020;Blum et al., 2022).All 45 substances were tested in the DNT IVB, and the overall performance of the DNT IVB was quantified by its specificity, sensitivity, and accuracy.Outcomes are shown for the standard protocol as well as all relevant changes in Figure 7: (i) Specificity (Fig. 7A): standard protocol and changes to it always had a specificity between 87.5% and 100%, i.e., truly DNT negative substances were almost always also judged as negative by the Model averaging led to the smallest number of relevant changes (7.13%), followed by replicate averaging by mean, fixed regression model (LL3rm) and control-normalization with moderate changes (13.43%-28.86%), up to > 40% changes if a higher BMR was used.More details on how changes to the standard evaluation protocol influenced the overall uncertainty of the BMC estimation are provided in the supplementary results (Fig. S2 1 ).
The impact of each methodological change on the BMCL (lower 5% confidence interval of the BMC) was assessed as the ratio between the BMCLs from the alternative protocols to the standard protocol, with the ratio distributions summarized in Figure 5.With the exception of the delta approximation and an increased BMR, the methodological changes provided on average no systematic deviation towards larger or smaller BMCLs, with the occurrence of 10 times larger or smaller BMCLs well below 1%.The delta approximation led in more than 90% of all data sets (N = 409) to a larger BMCL (median = 1.42,P95 = 271.4),and increasing the BMR by 20% (BMR30 and 50) led for 99.5% of all data sets to a smaller BMCL.The latter indicates a higher statistical certainty of the BMC estimate with increasing BMR, however at the cost of fewer data sets for which sufficient data support was given to allow a BMC estimation (Fig. 4E).

Method impact on hazard classification
An important application of the BMC estimation is the endpoint-specific hazard classification of a test compound into one of five hit categories, i.e., if the compound produced sufficient data evidence to be judged as a DNT-specific hit, borderline hit, unspecific hit, no hit, or as not identifiable (due to missing data support).Although all decision trees were set up as automatic systems, some data scenarios provided insufficient data and were flagged for "expert judgement".The number of data scenarios for which the hazard classification was performed by expert judgement are listed in Table 1 for the standard protocol and for the seven methodological changes, divided according to the main decision trees developed for data from NPC or UKN assays.In total, 1563 classifications were conducted (NPC: 1347, UKN: 216), of which 68 (NPC: 30, UKN: 38) were flagged for expert judgement according to the standard protocol.All protocol changes led to similar numbers, with the exception of the delta method applied to data outcomes from NPC assay endpoints, which required expert input for three times as many classifications.A marked difference was observed between the decision trees for NPC and UKN assay endpoints, with up to 5 times more classifications flagged for expert judgement for UKN outcomes depending on the statistical method chosen.The main reason for this discrepancy is how the two classification trees dealt with data sets where the highest effect responses from an unspecific endpoint were below the BMR: these were always marked for an expert judgement in the UKN classification tree but more thoroughly checked by automatized algorithms in the NPC classification model before being flagged for expert judgement (Fig. 2, 3).
Due to the poor performance of the delta method regarding BMC uncertainty (Fig. 5) and the consequence of a more likely expert intervention in the automatic hazard classification (Tab.For each methodological change to the standard protocol, the number of hit changes is expressed as percentage of the total number of hit classifications, divided into "gains" (i.e., the percentual increase of hazard hits in relation to the standard protocol) and "losses" (i.e., the percentual decrease of hazard hits in relation to the standard protocol).Different bar segments represent the different classification categories.
pothesis that the selection of common biostatistical concentration-response methods can affect the performance of the DNT IVB, with our focus on BMC estimation, DNT hit classification, and its overall predictive performance for DNT adversity.All study outcomes are discussed at given experimental design, and data that is a data set is considered as pooled data from at least three independent experiments.

Experimental mean or median replicate
Altogether, our study outcomes strengthen the argument for using the median as average response per test concentration and experiment (replicate average) as statistical unit in the concentration-response regression analysis.The alternative use of the arithmetic mean led on average only to minor changes in BMC estimations and hazard classification outcomes, as most data sets had either no outliers or outliers occurred at concentrations that had only little influence on the regression analysis.Nevertheless, for a few data sets changing from median to mean had a strong influence on the best-fit regression analysis such that, in the worst case, the subsequent BMC estimation was not possible ("BMRnr") and performance parameters (specificity, sensitivity, accuracy) for identifying DNT adversity dropped.If effect data follow a symmetric distribution (or can be transformed accordingly) and are not affected by an outlier, the sample mean is known to be more efficient than the median, e.g., for normally distributed data the relative standard error of the median is ca.25% greater than the standard error of the mean (Maindonald et al., 2010).However, to protect the mean against an outlier would not only require outlier detection methods, which are per se problematic for assay endpoints of relatively high within-ex-DNT IVB, and the standard protocol seems to be robust against methodological changes in judging false negatives.(ii) Sensitivity (Fig. 7B): 23 of the 28 DNT substances (82.1%) were successfully identified by the DNT IVB when the standard protocol was used, but changes to it decreased the sensitivity.(iii) Accuracy (Fig. 7C): The best performance was achieved by the standard protocol (88.6%), followed by a methodological change to bootstrapping (86.4%), higher BMR levels (84.1%), mean replicates, control normalization, pre-defined regression model (all 81.8%), and model averaging (77.3%).The latter performed 11.3% below the accuracy value of the standard protocol.A detailed overview over the hit definition of all control compounds is given in Section 2.1 of the supplementary file (Tab.S6-S8 1 ).

Discussion
The basis for this biostatistical study is a compound screening project performed on behalf of an EFSA procurement during the years 2017-2020 (OC/EFSA/PRAS/2017/01).Twelve DNT in vitro test methods with accompanying cytotoxicity and viability assays belonging to an OECD DNT IVB (Crofton and Mundy, 2021) were challenged with 148 compounds from different compound classes including expected negative control compounds (Masjosthusmann et al., 2020;Blum et al., 2022).DNT-specific assay endpoints assess biologically complex systems like changes in key processes in brain development over time, and therefore their test outcomes are more diverse in terms of data variability and concentration-response pattern than typically observed for data from simple reporter gene assays.Here we tested the hy- the data in the best possible way, with their number of model parameters mirroring different degrees of data complexity, and the model with the lowest number of parameters favored over a more complex model when it can describe the data almost as accurately as the more complex model (parsimony).The number of different concentration-response functions selected as best-fit model suggests that no individual regression function can describe every possible data scenario, as demonstrated for the example of the 3-parameter log-logistic function (LL3, Tab.S2 1 ), a reparametrized form of the Hill function (but without the log10 transformation of the concentration term), which is often used as the standard regression model in pharmacology and toxicology.In line with theoretical expectations and previously reported simulation studies (Zhu et al., 2007;West et al., 2012;Piegorsch et al., 2013), the best-fit model approach responded more flexibly to data sets and therefore often resulted in BMC estimations that differed significantly from those derived by this pre-fixed single model.
For 1 out of 3 data sets, the two-parameter exponential function was chosen as best-fit model, indicating that a relatively low model complexity was necessary to describe the observed data pattern.This typically applied to DNT data sets where only the two highest concentrations had produced significant effects.Therefore, the simplest mathematical functions (e.g., exponential and linear) should always be included in the pool of candidate models to ensure a BMC estimation also for data sets with only limited data support for regression modelling.It should be noted that the best-fit approach is purely data-driven, and therefore different regression models can be chosen for the same assay endpoint, but as a data set is defined as pooled from independently repeated experiments, it avoids that different models are used for the same test compound and bioassay endpoint.
Model averaging is historically motivated by the typically low number of doses in animal studies that can provide meaningful data for regression modeling and the subsequent problem that different regression models can describe the observed dose-response data equally well but interpolation into a dose region with little or no data may result in very different response (and BMD) estimates (EFSA, 2017).A statistical argument in favor of model averaging is that uncertainty of the model selection process of the best-fit method is not incorporated in the BMD and associated BMDL estimation (West et al., 2012).Our study shows no big differences between both methods, and we attribute the higher number of failed BMC estimates for model averaging (Fig. 4D) to the fact that the simple exponential function was excluded from the pool of candidate models for model averaging but proved to be superior for data sets where two or fewer concentrations induced significant but often weak assay responses.Based on statistical reasoning, model averaging should be the preferred approach, however, the corresponding inference can only be expressed either by a Wald type of confidence interval, which can produce a negative value for the BMDL, or by using computer-intensive parametric bootstrap (Aerts et al., 2020), which in our study failed too often for "poor" data scenarios.
perimental variability (such as DNT IVB) and small sample sizes (typically N = 3), but also a decision on how to handle these values in further data analyses (e.g., removing, winsorization, or trimming).The median provides a simple way to circumvent these complex statistical decisions and can thus be considered an ideal choice to enhance the robustness of an automatic data evaluation pipeline.

Data normalization to control and re-normalization
The normalization of effect data to the outcomes of test concentrations can be a viable option to safeguard against an ill-defined negative control reference and therefore avoid a biased BMC estimation and incorrect hazard alerts.The re-normalization of response data should be done whenever sufficient data evidence is provided for non-exposure related effect responses at lowest concentrations which have been confirmed by independent experiments.If not justified, an existing exposure effect can be judged wrongly as a technical or biological artifact and misused as zero effect response in the statistical concentration-response analysis.Although a random explanation is theoretically possible, e.g., the control readouts were not representative and only rare "unlucky" outcomes, the confirmation by independent experiments points to a non-random, compound-specific cause.However, reasons for this phenomenon are unclear, with biological effects and technical issues discussed (Krebs et al., 2018), and it is unknown whether this is a phenomenon specific to the DNT field or functional endpoints.
The experimental design for the assays from the DNT IVB was chosen such that the lowest test concentrations were expected to produce no treatment-related effect responses, and for more than 90% of all data sets the three lowest concentrations provided non-distinguishable effect responses.This and the frequent occurrence of misleading negative control responses for some of the assay endpoints was deemed sufficient for using the control re-normalization on all data sets (as part of the standard protocol).We cannot fully rule out that it was wrongly used for some data sets, and more robust decision rules are required for assuring successful data re-normalization.The criteria we recommend are (i) a certain minimum magnitude between the original and re-defined control references based on statistical reasoning (e.g., outside the detection limit), (ii) a minimum number of test low concentrations for which the effect responses provide no indications for a positive or negative trend (e.g., N = 3), and (iii) a minimum concentration range at which no effect can be judged (e.g., > factor 10).However, in case a control normalization can neither be judged by an automatic ruling system nor by a human expert, we suggest a repetition of the experiment at lower test concentrations.

Concentration-response model and BMC estimation
The biological complexity of DNT assay endpoints rules out a unique mechanistic model that would allow a 100% accurate quantitative representation of every possible shape of a concentration-response pattern.Therefore, various mathematical candidate functions were used as empirical models to describe between the experimental replicate medians always follows the Gaussian distribution, we expect this to be the case in less than 1% of cases.

Hazard identification
An endpoint-driven hazard classification method is essential for a reliable identification of hazard alerts, and DNT-specific endpoints should always take general cell health into account.However, there is neither a clear biological rationale on how to differentiate DNT cell functionality from general cell health nor a universally agreed quantitative approach on how to distinguish cytotoxic from DNT relevant concentrations.Our proposed hazard classification method puts a strong emphasis on cell health before declaring a test compound a specific DNT hit by allowing only a small fraction of the central 95% CI of the DNT BMC to be overlapped by the central 95% CI of the BMC for cell health (Fig. 2).Here a BUL corresponds to a 2.5 th percentile of the central 95% CI, which for cell health data sets with a relatively high data variability could mean more than a factor of 10 between the BMC and its BUL.This rigorous way of accounting for statistical uncertainty in the BMC estimates has the consequence that test compounds with weak DNT activity (which are often close to cytotoxic concentration ranges) are more likely to be classified as "borderline" hits.This should not be confused with unspecific hits, especially as mildly affected cell viability does not necessarily have a measurable impact on cell functionality.Nevertheless, defining the desired degree of statistical uncertainty in hazard classification methods requires a general acceptance in the DNT field, and other approaches might be viable in the same way.

Conclusion
Our study on 148 compounds tested in a large number of DNT in vitro assays demonstrates that statistical decisions that seem to be of minor importance can become decisive when it comes to hazard classification of a test substance.Although this study was conducted on concentration-response data from only the DNT IVB, we think many of the conclusions can be generalized to data from other specific toxicological endpoints, especially in the rising field of organotypic/stem cell-based cultures.In our experience, the proposed standard protocol for an automated data evaluation pipeline is the best compromise between the various statistical methods without "overcomplicating" the regression analysis and the corresponding BMC estimation, but we also acknowledge that the selected methods are not necessarily optimal for every data set.The drawback of an automated analysis is always the danger of not being prepared to deal with an unexpected data set, a scenario that can only be avoided by a case-by-case expert analysis.The strength of our data evaluation platform is the integration of endpoint-specific hazard classifications, including flagging systems for uncertain cases, which to our knowledge is novel.We consider it crucial for the hazard assessment to differentiate between general cell toxicity and specific DNT hits.
We used various statistical methods implemented in the drc and bmd R package to derive a confidence belt around the BMC, which is required for the BMCL (5 th percentile of the lower one-sided CI) or, in case of the hazard classification, the BLL and BUL (2.5 th and 97.5 th percentile of the two-sided CI).It should be noted that these methods (delta approximation, inverse regression, resampling methods) do not change the BMC estimation but try to calculate the uncertainty of the BMC estimation from the estimated regression model and experimental data.All methods have their pros and cons with different requirements for the data and regression models, and none of them can be ruled out a priori as inappropriate for the BMC estimation of a DNT IVB data set.Only further computer-intensive simulation studies could reveal the potential bias and coverage of the estimators at given data and model complexity, but this was not the aim of our comparative study.Nevertheless, our results show that simple common statistical methods such as the delta method do not necessarily guarantee a reliable estimation of the BMC confidence, and more sophisticated methods such as resampling require sufficient data support, which is often not given by the experiments.For instance, the bootstrap method puts a strong emphasis on the representativeness of the original data set from which samples are drawn repeatedly (virtual data sets), and if violated, can be prone to a biased interval estimation (i.e., mode of the resampled BMC distribution differs from the original BMC estimation), or in the worst-case, leads to an interval that hardly mirrors the observed data variability.Typically, DNT IVB endpoints are characterized by a relatively high between-study variability (illustrated by the BMRs, Tab.S1 1 ), small sample sizes (3-5 experimental medians), and a small number of test concentrations at which significant responses are observed.When all these data characteristics came together, regression resampling had a high chance for failure, independent of whether non-parametric, residual or parametric sampling was used.Until generally applicable decision rules about the minimal data requirements for bootstrapping can be implemented in an automatic data evaluation platform, it is difficult for the non-expert to make decisions about the usefulness of resampling for a particular data scenario, especially as bootstrapping can be performed in various ways.

BMR selection
The BMR level should be chosen as close as possible to the control level without compromising the statistical concentration-response analysis.Setting it too high (e.g., 50% for an IC50) involves the danger of overlooking hazard responses, which can lead to erroneous hazard hit classifications.We used the 1.5 sigma rule for the selection of the BMR, with sigma estimated as standard deviation from the between-experimental variation from a large set of historical data sets (Masjosthusmann et al., 2020).For a sample size of 3-5 independent experiments, we expected the estimation of a BMC for most data sets if a true BMC was present in the data, but an unexpected high data variability might have contradicted the 1.5 sigma rule such that the BMR was too low for a BMC estimation.If the scatter Fig. 4: Impact of methodological changes in data evaluation on the BMC estimationBMCs for 148 compounds tested on up to 22 endpoints from 8 assays were estimated using the standard protocol and alternative methods.(A-E) A relative BMC was expressed as the log10-transformed ratio between the 100-fold BMC and the maximum test concentration, and relative BMCs from all data sets and endpoints but different statistical methods were plotted against each other.The solid black line of identity indicates no difference between the relative BMCs, the grey interval around the line of identity indicates values within a three-fold range.Values outside this interval are considered as relevantly different between the methods.If a relative BMC could be calculated for only one method, the missing value of the opposing method is plotted as BMRnr area on the right or upper side of the graph.Relative BMCs are colored according to their bioassay endpoint.To indicate the strength of agreement between both data evaluation protocols, the goodness-of-fit coefficient from a trend regression analysis between both relative BMCs is included (R 2 ; top left), and the percentage of successfully applied regression models of the alternative protocol in relation to the standard protocol is shown top left ("fit success rate").(A) Experimental median replicates versus mean replicates (n = 568).(B) Re-normalized data versus controlnormalized data (n = 630).(C) Best-fit approach versus a predefined three parameter log-logistic regression model (n = 520).(D) Inverse regression versus model averaging (n = 604).(E) BMR10+30 (BMR10+25 for UKN) versus BMR30+50 (n = 604).(F) Percentage of all data sets for which the protocol change led to a BMC change in terms of BMRnr (i.e., a BMC could not be determined from the regression fit) or an above three-fold BMC change.

Fig. 6 :
Fig. 6: Number of endpoint-specific DNT hit classification changes in response to changes in the standard data evaluation protocol (A-C) Exemplary data sets for three different classification scenarios: concentration-response data from the specific (blue) and unspecific (black) endpoints are from 5 independent experiments, with effect responses re-normalized to the regression estimate at lowest test concentration and summarized as mean ± SEM.Horizontal lines indicate the BMR levels for the BMC estimation, where straight lines indicate the specific endpoint BMR and dotted lines the unspecific endpoint BMR (if they differ).Data were always analyzed according to the standard data evaluation protocol.(A) Specific hit: the specific endpoint (oligodendrocyte differentiation) is impacted at non-toxic concentrations.(B) Unspecific hit: inhibition of oligodendrocyte differentiation and cell viability are observed at similar concentration ranges.(C) Hit classification by expert judgement: an automatic hit classification was prevented by ambiguous data but judged as "unspecific" by experts.(D)For each methodological change to the standard protocol, the number of hit changes is expressed as percentage of the total number of hit classifications, divided into "gains" (i.e., the percentual increase of hazard hits in relation to the standard protocol) and "losses" (i.e., the percentual decrease of hazard hits in relation to the standard protocol).Different bar segments represent the different classification categories.

Fig. 7 :
Fig. 7: Evaluation of the predictive performance of the DNT IVB based on the standard data evaluation protocol and changes Bar graphs show the results of the predictive capability of the DNT IVB for 28 substances of known DNT and 17 negative control substances in terms of specificity, sensitivity, and accuracy.

Tab. 1: Number of endpoint-specific DNT hit classifications judged by experts The
numbers of hit classifications by expert judgement are presented as percentage of all classifications that were supervised by the hazard decision trees.
a NPC, Data outcomes from NPC assays; b UKN, Data outcomes from UKN assays