The Impact of Precision Uncertainty on Predictive Accuracy Metrics of Non-Animal Testing Methods

The ability of non-animal methods to correctly predict the outcome of in vivo testing in repeated applications is referred to as precision. Due to dichotomizing continuous read-outs into discrete “positive/negative” hazard data, non-animal methods can reveal discordant classifications if results are sufficiently close to a defined classification threshold. This paper explores the impact of precision uncertainty on the predictive accuracy of non-animal methods. Using selected non-animal methods for assessing skin sensitization hazard as case study examples, we explore the impact of precision uncertainty separately and in combination with uncertainty due to varying composition and size of experimental samples. Our results underline that discrete numbers on a non-animal method’s sensitivity, specificity, and concordance are of limited value for evaluation of its predictivity. Instead, information on the variability and the upper and lower limits of accuracy metrics should be provided to ensure a transparent assessment of a testing method’s predictivity, and to allow for a meaningful comparison of the predictivity of a non-animal method with that of an animal test. This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is appropriately cited.


Introduction
Although animal studies do not accurately predict human toxicity (Russell and Burch, 1959), they are still regarded as "gold standard" and results of so-called "alternative methods" (also termed "non-animal methods or "new approach methods") are compared to results of animal studies.In rare cases, sufficient data from humans are available to serve as a reference.The degree of agreement between results obtained by in vivo and non-animal methods is quantified as accuracy metrics.Several studies have pointed to possible biases in non-animal testing methods' accuracy metrics due to inter-and intra-laboratory variability, which can impact non-animal testing methods' reliability (Weil and Scala, 1971;Agnese et al., 1984;Margolin et al., 1984;Worth and Cronin, 2001a;Hothorn, 2002 and2003).The analysis of inter-laboratory variability, i.e. the reproducibility of test results across different independent laboratories, has received wide attention in the scientific literature (Sakaguchi et al. 2010, Sirota et al. 2014).Though the relevance of intra-laboratory variability, i.e. the ability of in vivo tests or non-animal methods to reproduce their outcomes in repeated experimental applications, has been recognised (Bruner et al. 1996), systematic analyses of intra-laboratory variability have not been taken up for a long time.
Just recently, several studies assessing the intra-laboratory reproducibility of testing methods used for skin sensitisation hazard classification have become available.For the gold standard animal test, the local lymph node assay (LLNA), Dumont et al. (2016) Hoffmann (2015) and Kolle et al. (2013) analysed the variability of classifications caused by a dichotomisation of continuous read-outs into discrete "positive/negative" data.In particular, Kolle et al. (2013) determined a range around the classification threshold within which the LLNA reveals discordant results in repeated applications.This range has been called "grey zone" (Dimitrov et al., 2016) or "borderline range" (BR) (Kolle et al., 2013).The studies concluded that substances yielding test results within the BR can neither be classified positive or negative, but must be reported 'non-conclusive' or 'ambiguous'.This limits the LLNA's precision, i.e. its ability to reveal concordant results in repeated applications.Leontaridou et al. (2017) quantified the BR for the LLNA and selected non-animal testing methods, i.e. the Direct Peptide Reactivity Assay (DPRA) (Gerberick et al., 2004(Gerberick et al., , 2007)), the ARE-Nrf2 luciferase method LuSens (Ramirez et al., 2014(Ramirez et al., , 2016)), and the human cell line activation test (h-CLAT) (Ashikaga et al., 2006(Ashikaga et al., , 2010;;Sakaguchi et al., 2006Sakaguchi et al., , 2010)).Furthermore, their study determined borderline substances in the experimental sample tested with the "2 out of 3" integrated testing strategy (ITS), which is a combination of the aforementioned non-animal testing methods (Bauch et al., 2012;Urbisch et al., 2015).Their analysis showed that the number and percentage of substances considered borderline can be significant and depend on the relationship between the size of the BR and the substances yielding results within this range.
Clearly, substances with ambiguous hazard classification cannot contribute to determining a testing method's predictive accuracy.Consequently, ignoring the BR in a testing method's prediction modeland, hence, precision uncertaintymay bias the assessment of classification accuracy.So far, however, the size and direction of this bias has not been explored systematically.
The aim of this paper is, therefore, to examine the impact of precision uncertainty on sensitivity, specificity and concordance.For our analysis we use non-animal testing methods for skin sensitisation hazard classification, i.e. the DPRA, LuSens and the h-CLAT, as case study examples.Since standalone non-animal testing methods are considered not to provide sufficient information for hazard classification (ECHA, 2016;Mehling et al., 2012;Reisinger et al., 2015), we also include the "2 out of 3" ITS in the analysis.Skin sensitisation hazard seems to be a particularly suitable endpoint because it is required for safety evaluations of new and existing substances in different regulatory frameworks of the European Union (e.g. the REACH legislation (EC, 2006), the Cosmetics regulation (EC, 2009)).Furthermore, both human and animal data are available as a reference and several non-animal methods and testing strategies are already in regulatory use (Kleinstreuer et al., 2018;Sauer et al., 2016).
Accuracy metrics are determined in relation to the LLNA and to human reference data.We analyse the impact of precision uncertainty on accuracy metrics of the selected non-animal methods disregarding the limited precision of the in vivo test.While this may be a simplification, it reflects the postulates underlying to the validation process of non-animal methods.
For our analysis we adopted we adopted a step-wise approach: First, we compared sensitivity, specificity and concordance derived from experimental samples including borderline substances (in the following called "the entire sample") with accuracy values obtained after borderline substances were excluded (in the following called "the adapted sample").Besides the specification of the classification threshold, testing methods' accuracy depends on the size and composition of experimental set of substances.These can vary considerably across testing methods, depending on the availability of robust in vivo and, if applicable, human data, a balanced representation of sensitisation potency classes, and a balanced number of sensitisers and non-sensitisers (ECVAM, 2012(ECVAM, , 2013)).Furthermore, there is no defined minimum number of substances below which an experimental sample would be considered insufficient for robust evaluations of non-animal testing methods' predictive accuracy.Recently, the OECD has defined performance standards for the ARE-Nrf2 luciferase method, recommending a minimum list of 20 reference chemicals that should be used for assessing the testing method's predictive capacity and reproducibility (OECD, 2015b;Kolle et al., 2019).
Therefore, second, we examined the impact of precision uncertainty on non-animal testing methods' accuracy in combination with uncertainty due to variations of sample size.Using non-parametric bootstrapping analysis (Wehrens et al., 2000) we created randomised experimental samples for every non-animal testing method considered.For each experimental this generated sample distributions of sensitivity, specificity and concordance.We quantified the mean, the standard deviation, and the 95% confidence limits for all accuracy metrics, and compared mean accuracy metrics to those obtained from the sample of substances used in validation studies.Moreover, we examined the joint impact of precision uncertainty and varying sample composition by comparing accuracy metrics from randomised full samples (i.e.including borderline substances) with those retrieved from adapted samples (i.e.excluding borderline substances).Third, we analysed the joint impacts of sample composition, sample size, and precision uncertainty on classification accuracy of the selected non-animal methods and the ITS.

2
Materials and methods

Non-animal testing methods
Responding to the urgent need to minimise animal testing, several non-animal testing methods and integrated testing strategies for assessing skin sensitisation hazard have been developed (Mehling et al., 2012;Reisinger et al., 2015;Urbisch et al., 2015).The methods used for our analysis are briefly described below.

The Direct Peptide Reactivity Assay (DPRA)
The DPRA (Gerberick et al., 2004(Gerberick et al., , 2007) ) measures the depletions of two model peptides containing a cysteine-or lysine residue as a reactive nucleophilic centre after incubation with a test substance.If the mean cysteine-and lysine-peptide depletion is above 6.38%, when compared to depletion in the reference control, the test result is positive and the substance is considered to be peptide reactive.The DPRA was validated by the European Centre for Validation of Alternative Methods (ECVAM) (EURL ECVAM Scientific Advisory Committee, 2016), and OECD test guidelines 442C have been adopted.

The ARE-Nrf2 luciferase method LuSens
The test method of the ARE-Nrf2 luciferase method covered by KeratinoSens TM (Natsch et al., 2011) and LuSens (Ramirez et al., 2014(Ramirez et al., , 2016) ) is described in Test Guideline no.442D (OECD, 2015a; OECD, 2018a) and addresses the second key event of the skin sensitization AOP, namely keratinocytes activation, by assessing with the help of luciferase, the Nrf2-mediated activation of antioxidant response element (ARE)-dependent genes.Skin sensitisers have been reported to induce genes that are regulated by the ARE.Small electrophilic substances such as skin sensitisers can act on the sensor protein Keap1 (Kelchlike ECH-associated protein 1), by e.g.covalent modification of its cysteine residue, resulting in its dissociation from the transcription factor Nrf2 (nuclear factor-erythroid 2-related factor 2).The LuSens test method makes use of an immortalised adherent cell line derived from human keratinocytes stably harbouring a luciferase reporter gene under the control of the ARE.The luciferase signal reflects the activation of endogenous Nrf2 dependent genes by sensitisers (OECD, 2018a).For determining the keratinocyte activating potential, the LuSens (Ramirez et al., 2014(Ramirez et al., , 2016)), measures the luciferase induction after treatment with a test substance.If a statistically significant fold induction (FI) of the luciferase activity is above 1.5, at relative cell viabilities of at least 70%, the test result is positive and the substance is considered to have a keratinocyte activating potential.

The human Cell Line Activation Test (hCLAT)
The h-CLAT method (Ashikaga et al., 2006(Ashikaga et al., , 2010;;Sakaguchi et al., 2006Sakaguchi et al., , 2010) ) has recently been validated by ECVAM and is described in OECD Test Guideline no.442E (OECD, 2018b).It addresses the third key event of the skin sensitization AOP, namely dendritic cell activation.The induction of the expression of the cell surface markers CD54 and CD86 is measured after treatment with a test substance, relative to concurrent vehicle controls in immortalized human monocytic leukemia THP-1 cells as a surrogate of DCs.These surface molecules are typical markers of monocytic THP-1 activation and may mimic DC activation, which plays a critical role in T-cell priming.The changes of surface marker expression are measured by flow cytometry following cell staining with fluorochrome-tagged antibodies.(OECD, 2018b).If at least a twofold induction of the CD54 expression and/or a 1.50-fold induction of CD86 expression are observed at relative cell viabilities of at least 50%, the test result is positive and the substance is considered to indicate a dendritic cell activating potential.
The "2 out of 3" ITS The "2 out of 3" ITS (Bauch et al., 2012;Urbisch et al., 2015) combines test results from the DPRA, ARE-Nrf2 luciferase method (covered by either LuSens or KeratinoSens TM ) and the h-CLAT and it consists the first case study in the recent report of the OECD describing Defined Approaches to be used within Integrated Approaches to Testing and Assessment (IATA for skin sensitisation (OECD, 2016b).Equal weights are attached to each of the non-animal testing methods, which capture the three key events of the skin sensitisation AOP.The overall classification of a substance is determined by the majority of concordant test results from the DPRA, LuSens or KeratinoSens TM and the h-CLAT, respectively.

2.2
Methodological approach for quantifying the borderline range around a testing method's classification threshold For the classification of test substances' sensitisation potential both animal and non-animal testing methods apply prediction models using defined threshold values that dichotomise continuous experimental results into binary, i.e. positive and negative, outcomes (Hoffmann and Hartung, 2005;Van der Schouw et al., 1995;Leontaridou et al. 2017).Comparing experimental results obtained with a non-animal testing method with animal or human data allows quantifying the fractions of substances revealing true positive (TP), true negative (TN), false positive (FP), or false negative (FN) results (Krzanowski W. L., 2009).Following to this, a non-animal testing method's predictive accuracy, e.g.sensitivity, specificity, and concordance can be determined.Predictive accuracy metrics specify a non-animal testing method's ability to correctly classify an unknown substance compared to the reference animal test (e.g. the LLNA, (OECD, 2010).
Following Leontaridou et al. (2017), the BR of a testing method denotes the symmetric range of one pooled standard deviation (  ) on both sides of a testing method's classification threshold T, pooled across substances and concentrations used: (1) For a given testing method, the   of experimental results retrieved from testing different substances and concentrations is defined as: where  , 2 is the variance of results for substance i and concentration . , denotes the number of replicates per substance  and concentration .The variance of experimental results is given by which acknowledges that for a certain concentration different replicates can be generated. ,, denotes a particular test result of substance i, concentration  and replicate l, and  ̅ , is the arithmetic mean of test results for substance i and concentration .
Using experimental results for the DPRA, LuSens, the h-CLAT and for the "2 out of 3" ITS published in (Bauch et al., 2012;Urbisch et al., 2015), Leontaridou et al. (2017) identified the substances yielding test results within the BR of these methods and the "2 out of 3" ITS.Table 1 presents the parameter values for calculating the BR and the number of substances yielding test results falling in the BR.
Tab. 1: Range of test results around the classification threshold used to calculate the pooled standard deviation SDp, the borderline range of each testing method and the "2 out of 3" ITS, the size of the entire sample and of the adapted sample of substances after excluding the number (percentage) of substances identified as borderline

Scenarios for analysing the impact of precision uncertainty on non-animal methods' predictive accuracy
To examine the impact of limited precision, sample size and sample composition on classification bias of non-animal testing methods we defined different scenarios, which are summarised in Table 2. First, we determined sensitivity (Eq.4), specificity (Eq.5) and concordance (Eq.6) using experimental results from substances samples tested with the DPRA, LuSens, the h-CLAT, and the "2 out of 3" ITS, based on LLNA (Bauch et al., 2012;Natsch et al., 2013;Urbisch et al., 2015).
The composition of experimental samples, and test results compared to LLNA and human reference data, are documented in Table S1-S41 .Accuracy metrics were derived from experimental test results for the entire samples of substances (i.e.including the borderline substances) and for adapted samples (i.e.excluding the borderline substances) following the standard approach discussed in Cooper et al. (1979).The difference of accuracy metrics between the entire and the adapted sample illustrates the impact of precision uncertainty (Scenario 1 in Table 2).Second, accuracy metrics were calculated for randomised samples.Randomisation was achieved by applying nonparametric standard bootstrap resampling analysis (Table 3).This method was used earlier by (Worth and Cronin, 2001b) to assess the variability of the Draize tissue scores.Our study applies a similar approach but focuses on assessing the combined impact of varying sample composition and limited precision on non-animal methods' accuracy.The bootstrap resampling analysis was conducted in Microsoft Excel.We determined accuracy metrics for the entire sample (i.e.including borderline substances, Scenario 2a in Table 2) and for the adapted sample (i.e.excluding borderline samples after randomisation, Scenario 2b in Table 2).The difference of accuracy metrics between the two cases illustrates the joint impact of sample composition and precision uncertainty on predictive accuracy metrics.For every non-animal method and the "2 out of 3" ITS a set of y = 10,000 randomised samples (Efron, 1993;Ostaszewski K. and Rempala G.A., 2000) was created by random replacement of the binary classifications obtained from experimental test results (Table 3, Step 1).The number of substances in randomised samples, denoted k, was equal to the number of substances in the entire and adapted experimental samples (column 4 and 5 in Table 1).Randomised samples were assumed to be independent and identically distributed (Wehrens et al., 2000).For all randomised samples we determined sensitivity, specificity and concordance (Se * , Sp * and Con *, respectively).This revealed non-parametric distributions of sensitivity, specificity and concordance for the entire samples (i.e.including borderline substances), and for the adapted samples (excluding borderline substances, see also Table 3, Step 2 and 3).For every distribution we determined the mean and the standard deviation (SD) according to Eq. ( 7) and ( 8): with a denoting the accuracy metric determined from the randomised sample, and y denoting the number of random samples ( = 10,000).

Tab. 3: Steps for conducting non-parametric standard bootstrap resampling analysis
Step

Description Step 1
Bootstrap resampling with random replacement of experimental test results from the individual non-animal testing methods and the "2 out of 3" ITS.

Step 2
Quantification of sensitivity, specificity and concordance for the bootstrap sample (a).

Step 4
Calculation of the mean, the standard deviation and the 95% confidence interval of the distributions obtained for sensitivity, specificity and concordance.
In addition, we calculated confidence limits using the simple percentile method.Specifically, the 95% confidence interval (95% CI) was determined by the value corresponding to the 2.5% and 97.5% percentile in the bootstrap distribution of sensitivity, specificity and concordance, respectively (Table 3, Step 4).Third, we assessed accuracy metrics for randomised sub-samples of varying sizes in order to analyse the joint impact of uncertainty in sample size and composition, and of precision uncertainty on non-animal testing methods' accuracy (Scenario 3a and 3b in Table 2).Following the procedure for bootstrap resampling outlined in Table 3 we calculated the mean, the SD and the 95% CI of the predictive accuracy metrics for each sub-sample, including and excluding borderline substances.Sub-samples consisted of k = 10; 50; 100; and 150 substances for the DPRA (with random replacement from the experimental sample consisting of 199 substances), k = 10; 20; 40 and 60 substances for LuSens (with random replacement from the experimental sample consisting of 79 substances), and of k = 10 and 20 substances for the h-CLAT and the "2 out of 3" ITS (with random replacement from the experimental samples consisting of 40 substances), respectively.

Results
This section presents the results from analysing the impact of borderline substances on accuracy metrics of DPRA, LuSens, h-CLAT and the "2out of 3" ITS, using the LLNA as reference test.Tables S5-S7 1 show results obtained from experimental samples compared to human data.

Impact of precision uncertainty
Excluding test results within the BR increased sensitivity, specificity and the concordance of the DPRA (with 20 out of 199 substances yielding results in the BR), but had little effect on the predictive accuracy metrics of the LuSens (with 5 out of 79 substances with results in the BR) (Tab.4).Sensitivity and the concordance of the h-CLAT decreased (with 8 out of 40 substances with results in the BR) after excluding borderline substances.Likewise, excluding results falling in the BR for the combination of the three methods in a "2 out of 3" ITS decreased sensitivity and the concordance (86% to 82% with 4 out of 40 substances in the BR).Note that the "2 out of 3" ITS reveals less borderline substances than the h-CLAT because the final classification is based on results from three methods.Hence, LuSens and DPRA results, which are not in the BR, may be sufficient for concluding a "2 out of 3" ITS result to be outside the BR even if the result obtained from the h-CLAT, as a third assay, is within.

Joint impact of precision uncertainty and varying sample composition
As a result of the non-parametric bootstrapping procedure we received for each non-animal testing method a distribution of accuracy metrics.For the DPRA and LuSens we found predictive accuracy metrics to be normally distributed, while for the h-CLAT and the "2 out of 3" ITS approach we observed a left-skewed distribution of accuracy metrics.
Randomisation causes the sample composition to differ.Consequently, the number of borderline substances differs as well.The minimum and maximum number of substances in the adapted samples (i.e. after borderline substances were excluded) is shown in Table 5.

3.4
Joint impact of varying sample composition and sample size in combination with precision uncertainty Finally, the parameters revealed for the distributions of sensitivity, specificity and concordance from adapted samples (i.e. after excluding borderline substances, see scenario 3b in Table 2) offer insight into the joint impact of varying sample size variation, sample composition, and precision uncertainty on predictive accuracy metrics (Table 8).Note that due to the randomisation of experimental samples the number of borderline substances within sub-samples could differ.Thus, similar to Table 5 we can determine the minimum and maximum number of substances for all subsamples after excluding borderline substances, which is shown in column 1 of Table 8.Distribution parameters of accuracy metrics (columns 2-4) capture the range of sample sizes per sub-sample.

Impact of uncertainty due to limited precision
Acknowledging that borderline substances cannot be classified as "positive" or "negative", we expected that predictive accuracy metrics derived from the entire samples (i.e.including borderline substances) will differ from those of adapted samples (i.e.excluding borderline substances).Indeed, our results show that accounting for precision uncertainty changed accuracy metrics of the DPRA, LuSens, the h-CLAT, and the "2 out of 3" ITS approach (Table 4).However, this impact was not symmetric across individual non-animal testing methods and the "2 out of 3" ITS.In particular, whereas for the DPRA sensitivity, specificity and concordance increased after the exclusion of borderline substances from experimental samples, for the h-CLAT sensitivity and concordance derived from test results compared to the LLNA decreased considerably and remained almost unchanged when derived from test results compared to human data (see Tab. S7 1 ).For LuSens we observed a small increase of sensitivity, but no change and even a slight decrease of specificity and concordance values, respectively.For the "2 out of 3" ITS approach we found an increase of sensitivity, specificity and concordance when assessed with substances compared to the LLNA.Similar results were observed for the accuracy metrics when test results of the non-animal testing methods were compared to human data (see Tab. S7 1 ).
For individual non-animal testing methods the size and direction of the impact on accuracy metrics also depends on the composition of experimental samples and whether test results of borderline substances are above or below the classification threshold.If, as in the case of the DPRA, more borderline substances revealed results below the classification threshold (thus they would be classified as 'negative' when ignoring the BR), excluding these substances increases the faction of substances classified as 'positive', which in turn causes an increase of sensitivity.In contrast, test results of substances providing results in the BR of the h-CLAT were all above the classification threshold (Leontaridou et al., 2017).Excluding these substances in order to correct for ambiguous classifications decreases sensitivity for the experimental sample with the LLNA as reference test (see Tables 4 and 6).In addition, since accounting for the BR changed the fractions of TP and FN classifications of the substances remaining in the sample, we observed a slight decrease of concordance.In case of LuSens, only few substances were identified as borderline in the experimental sample (4 out of 79 substances, i.e. 5%; Leontaridou et al., 2017).Due to the stringent acceptance criteria and the prediction model of LuSens (Ramirez et al., 2014(Ramirez et al., , 2016)), excluding these substances had only marginal impact on predictive accuracy metrics.Finally, we found a slight decrease of sensitivity and concordance, and no change of specificity of the "2 out of 3" ITS.This can be explained by considering that the prediction model of the "2 out of 3" ITS bases the overall conclusion about the skin sensitisation potential of a substance on at least two concordant test results from the DPRA, LuSens or the h-CLAT, and assigns equal weights to each testing method.This balances the impact of precision uncertainty on predictive accuracy metrics observed for the individual non-animal testing methods.

4.2
Impact of precision uncertainty and varying sample composition Determining accuracy metrics from randomised samples allows specifying their variation depending on the sample composition (scenario 2a in Table 2).For all individual non-animal testing methods and the "2 out of 3" ITS the mean of the distributions of accuracy metrics corresponded to the values observed for the deterministic experimental samples (Table 6).
Our results illustrate that accounting for precision uncertainty in combination with randomised sampling did not affect the mean of accuracy metrics.In case of the h-CLAT and the "2 out of 3" ITS we observe a slight increase of the SD of accuracy metricsand thus a small increase of the overall uncertainty of these metrics.Furthermore, we found the 95% confidence interval to slightly shift for most accuracy metrics

4.3
Impact of uncertainty due to sample size Our results illustrate that an increasing sample size (scenario 2b in Table 2) decreased the variability of predictive accuracy metrics expressed by the SD (Table 7).This was found for all individual non-animal methods and the "2 out of 3" ITS.More specifically, for individual methods and the "2 out of 3" ITS, the SD of accuracy metrics from randomised sub-samples was up to four times higher than the SD obtained from randomised full samples.Note that both randomised full and sub-samples still include borderline substances.Furthermore, the 95% CI of predictive accuracy metrics was considerably larger, implying that for very small sample sizes a robust prediction of predictive accuracy metrics cannot be provided.Similar results were found for the adapted samples (i.e.excluding borderline substances, see Table 8) Finally, for the DPRA we observed that the SD and the 95% CI values decreased for sample sizes up to k = 100 substances, but remained fairly stable at samples sizes of  ≥ 100 substances, irrespective of whether borderline substances were included or excluded from the samples.Our findings suggest, therefore, that uncertainty due to varying sample size (and composition) is large and highly dependent on the actual sample size (and composition) when derived from samples containing  ≤ 50 substances (for the DPRA; the uncertainty was somewhat smaller with the LuSens and h-CLAT).This animal methods is a next step to allow for comparisons of biases in testing methods' predictive accuracy between animal and non-animal testing methods.

Tab. 4 :
Impact of precision uncertainty on predictive accuracy metrics of non-animal testing methods and the ", standard deviation (SD) and 95% confidence intervals of accuracy metrics of the DPRA, LuSens, the h-CLAT and the "2 out of 3" ITS determined from randomised samples

Range of test results around T for calculating the pooled standard deviation SDp * SDp and BR
Leontaridou et al. (2017)ld; BR: Borderline range; SDp: pooled standard deviation; FI: Fold induction; CD54 FI, CD86 FI: Fold induction of cell surface marker expression; MPD: Mean peptide depletion; * The chosen range is the smallest of a set of ranges tested inLeontaridou et al. (2017).Source: Adapted fromLeontaridou et al. (2017).