Evidence-Based Toxicology – the Toolbox of Validation for the 21 st Century ?

toxicity testing is at a critical crossroad, prompted three years ago by the publication of the NRC report on toxicology in the 21st Century (tox-21c) (NRC, 2007) and the discussion it triggered (Collins et al., 2008; Gibb, 2008; Andersen and Krewski, 2009; Kavlock, 2009; Kavlock et al., 2009). the report addressed the adequacy of the current tools with regard to throughput, costs, predictability of human responses, and animal use. Similar thoughts were followed in the 2004 FDA Critical Path Initiative (Coons, 2009). Obviously, the current methods for hazard assessment are largely the same for industrial chemicals (Hartung, 2010b), nanoparticles (Hartung, 2010c), pesticides, and drug candidates. the assessment of toxicological risk relies primarily on in vivo animal experiments that were designed decades ago and cost about $ 3 billion/year worldwide (Bottini and Hartung, 2009). their low throughput has led to a backlog of substances whose potential toxicity remains to be adequately assessed (Grandjean and landrigan, 2006; Judson et al., 2009) and hinders front loading of toxicity testing in the drug development process. Species differences in toxicity responses require the use of uncertainty factors for human risk assessment (NRC, 2000). Moreover, to minimize the risk of not identifying a human hazard, animal studies are usually performed at doses orders of magnitude higher than realistic human exposure (Hartung, 2009a). In the field of drug development about 92% of substances fail during clinical trials, about 20% of which are due to toxic effects in humans not identified in pre-clinical animal testing (FDA, 2004). thus, there are serious concerns regarding the efficiency and relevance of current toxicity testing methods for human health effects.


Introduction
toxicity testing is at a critical crossroad, prompted three years ago by the publication of the NRC report on toxicology in the 21 st Century (tox-21c) (NRC, 2007) and the discussion it triggered (Collins et al., 2008;Gibb, 2008;Andersen and Krewski, 2009;Kavlock, 2009;Kavlock et al., 2009).the report addressed the adequacy of the current tools with regard to throughput, costs, predictability of human responses, and animal use.Similar thoughts were followed in the 2004 FDA Critical Path Initiative (Coons, 2009).Obviously, the current methods for hazard assessment are largely the same for industrial chemicals (Hartung, 2010b), nanoparticles (Hartung, 2010c), pesticides, and drug candidates.the assessment of toxicological risk relies primarily on in vivo animal experiments that were designed decades ago and cost about $ 3 billion/year worldwide (Bottini and Hartung, 2009).their low throughput has led to a backlog of substances whose potential toxicity remains to be adequately assessed (Grandjean and landrigan, 2006;Judson et al., 2009) and hinders front loading of toxicity testing in the drug development process.Species differences in toxicity responses require the use of uncertainty factors for human risk assessment (NRC, 2000).Moreover, to minimize the risk of not identifying a human hazard, animal studies are usually performed at doses or-ders of magnitude higher than realistic human exposure (Hartung, 2009a).In the field of drug development about 92% of substances fail during clinical trials, about 20% of which are due to toxic effects in humans not identified in pre-clinical animal testing (FDA, 2004).thus, there are serious concerns regarding the efficiency and relevance of current toxicity testing methods for human health effects.

Validation -a blessing or a curse for a paradigm shift in toxicology?
Most test methods in regulatory toxicology have been established by fixing acute demands with reasonably available tools identified by consensus processes.Though scientific truth is established on the basis of irrefutable evidence, majority of opinion governs regulatory toxicology.the process is driven by historic requirements, contemporary scientific knowledge, individuals who happen to be in charge, political decisions, etc. and is thus circumstantial rather than strategic.No objective criteria other than reproducibility, some reference toxicants, and cost/effort considerations are available, creating what might be termed a status of "face validity" or apparent validity.Actual relevance is typically established by "learning from experience."Heed-

Summary
Validation has become a primary driver of the evolution of toxicological methods.Agreement at OECD level currently requires validation of new approaches for consideration in test guideline development.Several examples of this exist.However, the toxicology in the 21 st century movement, prompted by the 2007 NRC/NAS vision document, might lead to a revolutionary change in the toxicological toolbox.The challenge is whether the validation process, as it has been formalized over the last two decades, meets the needs for this paradigm shift.The concept of evidence-based medicine (EBM) has emerged from clinical medicine, which retrospectively assesses the evidence of adequacy of a given approach.This is not typically done in prospective studies -the equivalent of validation studies might be multicenter randomized trials.Evidently, where such unambiguous evidence is available, no other assessment is necessary.EBM, however, has developed procedures, including meta-analysis, to collect and evaluate all the available evidence where no such definitive study is available.The recent successful introduction of retrospective validation, i.e. the collection and evaluation of existing evidence from various sources, represents a step in this direction.Here, we will explore new toxicological approaches via evidence-based toxicology (EBT).combinations or we need to find faster and more flexible approaches.the feasibility of a validation study must consider the required study size and duration relative to the speed of technological progress.
3 The needs of Tox-21c in validation tox-21c proposed a paradigm shift in toxicology.Instead of relying on traditional animal experiments, the report proposes the application of the latest advances in science and technology to develop more relevant test strategies.It has some similarities with alternative methods movement, but it is less animal welfare-driven and prompted more by limitations in throughput and predictivity (Hartung, 2008b;Hartung, 2010a).It also does not rely on a one-by-one replacement of methods with better (3Rs) ones.Instead, pathways of toxicity (PoT) shall be identified using in vitro cell systems (preferably human), high throughput testing, 'omics' approaches, systems biology, and computational modeling.A new battery of tests covering Pot will then form the basis for a new approach, which promises a truly revolutionary change (Hartung, 2008a).the so called "pathways of toxicity" are defined as changes in normal biological processes, e.g.cell function, communication, and adaptation to environmental changes which are expected to result in adverse health effects (NRC, 2007).toxins can affect multiple Pot in different cells systems, leading to various adverse effects (cell phenotypes) (Fig. 1).One familiar Pot described in the tox-21c report is the signaling by estrogens in which initial exposure results in en-ing the warning of Petr Skrabanek and James McCormick that "Learning form experience might be nothing more than making the same mistakes with increasing confidence" (Skrabanek and McCormick, 1989), we might ask ourselves, which mechanisms there are to challenge the validity of these methods? the situation is different when consensus has been established for particular tools.Such tools offer a point of reference to validate new approaches.Validation is defined (Balls et al., 1990;Hartung et al,. 2004;OeCD, 2005) as the independent assessment of a method for a defined purpose as to its reproducibility, scientific basis, and reliability/relevance.A culture of prospective ring trials has developed which is capable of reassuring the one-to-one replacement of a method by a better one, e.g., one limiting or refining animal use.
tremendous problems arise (Balls et al., 2006;Hartung, 2007a) where no reference method exists, the reference method is flawed, or the purpose or applicability of both methods is overlapping but not identical.Unfortunately, most areas of toxicology are a mixture of those problems.this is why very few replacement methods have been accepted, and, when they are, they often outperform the reference method and demonstrate its flaws.One obvious solution would be the joint comparison with an independent point of reference, such as human data, but this is rarely available and even more rarely done.the typical validation is done against the reference method only.
the concept of a reference method is often perverted to the concept of a "gold standard": this is most appropriate for reference substances and the definition of units of measure (where the concept originates).However, scientific methods arise from a competition of ideas and the point of reference must evolve.the term "traditional methods" is more adequate.However, we should be clear that this implies that we cannot get better than this reference method.the only alternative is to anchor our validation with other reference points such as clinical data, composite knowledge on certain substances, or understood mechanisms.We have suggested earlier (Hartung, 2007a;Hofmann et al., 2008;Ahr et al,. 2008) that such points of reference are chosen, but this will require expert judgment and consensus processes (for example see Vinken et al., 2008).
The concept of formal validation and its expanding definition in detail has taught a lot about method evaluation: there is no doubt that quality assurance of methods must be based on method definition (including purpose and applicability domain) and reproducibility.this is actually also the easy part; it gets difficult when scientific basis and relevance are addressed.Traditional validation has not really embraced scientific relevance and the relevance assessments suffer from the limitations of their point of comparison.
Furthermore, formal validation has become a lengthy and costly exercise.Idealized time estimates are three years for validation plus two for peer review and costs of $ 500,000 involving 3 or more laboratories.For a new battery of tests or a major revision of the toxicological toolbox the feasibility is quickly questionable.Similar to the problem of mixtures in toxicology, the problem of mixing methods in integrated testing strategies quickly exposes its limits, as far too many combinations are possible.So either we limit ourselves to the rigidity of fixed 2007; Chiu et al., 2007;Clewell and Clewell, 2008), informing human risk assessment.Retro-QSAR indicates that instead of current use, i.e. to describe the resulting tissue levels and kinetics for a given dose, the theoretical dose necessary to achieve this tissue level needs to be modeled.this type of new tool will also require validation and it remains to be explored, whether the principles of validation as adapted to (Q)SAR are applicable (Hartung and Hoffmann, 2009).
However, probably more important, if a substance does not trigger any of these PoT, for the first time it may be possible to establish the non-toxicity of a substance at a given concentration, which might be combined later with thresholds of toxicological concern (ttC) (Kroes et al., 2005).the major advance is that external doses need not be used for ttC, but threshold tissue levels can be used, bringing the concept one step closer to the effect.We might be able to establish the no-effect levels in tissues with the Pot-based methods and then estimate by retro-QSAR, which dose would be required to achieve such levels.
It is assumed that tox-21c will also integrate further in silico approaches (Hartung and Hoffmann, 2009).However, the tox-21c literature is relatively vague about Integrated testing Strategies (ItS).Practical considerations, in contrast, have led to suggestions of tiered testing strategies and, at least, the vision of integrated or "intelligent" testing (van leeuwen et al., 2007).the most advanced are found in the ReACH test guidance (european Chemicals Agency, 2008).these represent special challenges to their validation (Kinsner-Ovaskainen et al., 2009); a working definition of ITS has been proposed in this ePAA/eCVAM workshop: "In the context of safety assessment, hanced cell proliferation.The identification of these pathways would likely provide more accurate information for chemical or drug risk to humans than current animal tests (Gibb, 2008).Although a broad discussion has ensued on the design and feasibility of this new concept for toxicology, in practice we are still at the very beginning of the paradigm shift.this relates to the development of new technologies as well as to their quality assurance, acceptance, and integration (Hartung, 2009c).tox-21c emphasizes the concept of Pot, i.e. a mechanistic foundation and a reconstruction of safety assessments acknowledging the limitations of traditional approaches.It is obvious that this will require a role of the assessment of the scientific basis and relevance different to traditional validation.the question to be addressed here is, whether eBt lends itself to this problem.
tox-21c is envisaged as a toxicity testing approach, using primarily cellular and in silico methods as well as lower organisms and: -Is based on identified and annotated PoT, probably requiring a type of Human toxicology Project (Seidle and Stephens, 2009) for their comprehensive mapping -Is based on information-rich technologies (Hartung and leist, 2008;leist et al., 2008b) such as omics, image analysis and high-throughput testing, at least to identify the Pot or signatures of toxic effects but likely also for the actual testing; expectations are high that systems biology approaches might be translated to form a systems toxicology, which means the complex integration of several omics by bioinformatics.-Combines different methods in integrated testing strategies (ItS) -Is combined with PBPK modeling -Is predictive of human health risks (and environmental hazards; it should be noted that the concept of tox-21c has not been largely discussed for this area) Obviously, the new approach finally must meet similar standards as to quality assurance (validation including GlP and other good practices) and International harmonization (Bottini et al., 2008;Hartung, 2009c) as the current schemes.thus we will need to explore what this means for validation.
It is assumed that several hundred Pot exist, unlikely many thousands.Figure 2 shows how the identification of more and more Pot by expanding the number of toxicants and the number of cell systems should come to a saturation, i.e. indicating that the mapping is comprehensive.However, the number of variants of Pot is a possible threat to a Pot-based toxicity testing with regard to validation needs.How can we possibly validate these?Actually, the question starts one step earlier, i.e. how can we identify and annotate them?In the end, we likely need a project similar to a Human toxicology Project to map Pot.In contrast to the currently used phenomenological "black box" animal test, PoT need to be identified in human in vitro systems to provide more relevant, accurate and mechanistic information for the assessment of human toxicological risk.the future ultimate goal would be to map the entirety of human Pot, the human toxome.the concentration at which a substance triggers a Pot might be extrapolated to a relevant human blood or tissue concentration and finally a corresponding dose by (retro-) PBPK modeling (Yang et al., 1998;Barton et al,. 2007; Bouvier d'Yvoire et al., In addition to metabolomic analysis, gene array and real-time PCR transcriptome or proteomic profiling will likely be applied.Gene array technologies allow measuring the expression of thousands of genes at the same time, including the entire human genome.In contrast, real-time PCR provides a more specific and accurate tool to further confirm the expression level of a particular gene.transcriptomics provides a simple and sensitive endpoint to study the toxic responses in biological systems -in fact, gene expression changes are normally observed before any other changes occur.In addition, gene expression analysis in primary cultures is a promising tool to detect chemicals with toxic potential (Hogberg et al., 2009(Hogberg et al., , 2010)).In an eCVAM/IC-CVAM/NICeAtM workshop (Corvi et al., 2006) we have pioneered the discussion of validation of toxicogenomics tools.In this area in contrast to other omics technologies at least standard for reporting studies exist (Brazma et al., 2001), but other omics such as metabolomics are following (Griffin et al., 2007;Sansone et al., 2007;van der Werf et al., 2007).though it is not said that the Pot-based methods or method batteries will necessarily be omics technologies (omics might be used to identify Pot only and then reporter gene assays etc. are constructed to cover the Pot), it demonstrates the complexity of the problem to standardize and QA such methods.Running this under GlP will be extremely difficult, starting with the problem of GLP for cell culture methods in general.However, a major opportunity lies in the commercialization of new in vitro methods, as they lead to standardization and broad availability as kits and integrated machines.the ensuing dialogue of the in vitro/alternatives field with technology providers is most important.The recent concentration process forming a few very large technology providers (thermoFischer, lifetechnologies, Agilent, etc.) promises long-term availability of high-quality tools to base regulatory testing on.the chances for contract research organization (CRO) are enormous (Goldberg and Hartung, 2008) and again this quality-assures the methods for regulatory use.
Combining these omic approaches can establish a method for the identification of human PoT.The metabolic and genetic profiling data obtained need to be analyzed using the latest statistical data mining methodologies.First patterns in the datasets will be recognized, followed by the establishment of relationships between genes and metabolites.With the integration of existing biochemical knowledge, the biochemical pathways altered by substances to PoT can be identified in a systems biology approach.the validation of these Pot will among others rely on the use of gene silencing or knock-out technologies.
A key goal needs to be the definition and annotation of PoT and making them available in an open access database, which allows integration of PoT identified from other groups and systems.Pathway identification from omics approaches is approached in several areas including toxicology (Fischer, 2005;Cho et al., 2006;Bosl, 2007;Kwoh and Ng, 2007).CAAt is working with ePA toxCast (Judson et al., 2010) toward this key effort.An important element will therefore be a series of consensus workshops defining the database, its content, its validation, and its governance.CAAt and its transatlantic think tank for toxicology (t 4 ) (Daneshian et al., 2010) aim to organize these.
In order to confirm PoT, it will be necessary to inhibit either an Integrated Testing Strategy is a methodology which integrates information for toxicological evaluation from more than one source, thus facilitating decision-making.This should be achieved whilst taking into consideration the principles of the Three Rs (reduction, refinement and replacement)."This definition is very much imprinted by the largely european alternative method concept (Hartung, 2010a) and has little specificity.The workshop defined components of ITS as building blocks if they are "single test or a battery of tests with associated prediction model(s)/data interpretation procedures (DIPs), and ending with a decision step."The definition is in so far confusing as the prediction models and DIPs have so far been interpreted as an algorithm to estimate the in vivo result, which means something only the overall ItS can deliver.It is further stated that components of ItS require assessment of reproducibility and transferability, while "relevance (predictive capacity) belongs to the entire test strategy."Arguably, the authors state that such validation is only required for ItS for "Replacement of test Guideline used for regulatory purposes" while not for screening, hazard classification and labeling, and risk assessment.It is obvious that tox-21c will have to address these questions.
Before constructing an ItS based on Pot, it will be necessary to identify a substantial number of Pot, ideally the entire human toxome.to achieve the mapping of the human toxome, it will be necessary to combine several of the latest emerging technologies in life sciences.Human embryonic stem cells represent one of the most promising human cell systems (Davila et al., 2004;Bremer and Hartung, 2004;Pellizzer et al., 2005;Adler et al., 2008;leist et al., 2008a;Stummann and Bremer, 2008;Chapin and Stedman, 2009).Key advantages over other cell systems (Hartung, 2007b) are that they can generate an unlimited source of differentiated cells, which allows the testing of substances at a broad range of concentrations in a time and cost efficient manner.Secondly, they are pluripotent and can be differentiated into any tissue of the human body.However, standardized protocols are only emerging and it is not clear whether quality controls of Good Cell Culture Practice (Coecke et al., 2005) are sufficient for this new field.The use of a human cell system will avoid species differences in toxicity responses.In principal, however, any cellular system should be usable, but either human primary cells or organotypic cultures would be recommended.
the toxicity responses of the chosen cellular system towards reference substances will need to be studied especially by metabolic and genetic profiling methods.Metabolic profiling can be achieved based on a mass spectrometry based metabolomics approach, which has advantages over NMR-based metabolomics approaches in terms of sensitivity, number of detectable metabolites, and metabolite quantification.Since the metabolome is defined by gene, transcript, and protein changes, it is the omics science closest to the phenotype.the measured metabolic perturbations represent the final outcome of all physiological processes in the biological system making the approach most interesting for toxicology (Robertson, 2005).Until now, metabolomics has been mainly applied for in vivo toxicity studies based on metabolic profiling of non-invasive blood and urine samples.Previous work (van Vliet et al., 2008) has already demonstrated the value of metabolomics for in vitro toxicity studies.the development of the respective signature by blocking a toxicants effect or enhancing/reproducing it with an activation of the respective Pot. the former can either be achieved (where available) with specific inhibitors or more generally with siRNA.Almost every cell including human embryonic stem cells can nowadays be transfected with siRNA (Martinez di Montemuros and Parise, 2008), most efficiently with lentiviruses (Zaehres et al., 2005), but also with lipofection techniques (Matin et al., 2004;Vallier et al., 2004).the suppression of the respective gene in the candidate Pot can be controlled by quantitative real-time PCR of its mRNA.the goal must be to reproduce the metabolic changes as assessed by metabolomics with a block of a key element of the Pot.this means, that the derangement of downstream metabolites in the Pot is assessed and compared to those observed, when the Pot is deranged by toxicants.Such experiments can be extended to co-exposure with toxin after transfection to analyze a possible aggravation of the metabolic derangement, especially on basis of metabolomics assessing the metabolites relevant for the Pot.
Obviously such an approach represents rather a validation of the scientific mechanism or mode of action than of the model.Quality assurance means here mainly to link Pot with effects of well-established toxicants.Only at a later stage the presence of a Pot in a given test system can serve as a validation of the test (battery), e.g. by showing that the very same reference substances interfere via the Pot with the test.this scenario stresses the opportunity lying in validation based on Pot and the concept of scientific relevance for validation.We have earlier suggested this as "mechanistic validation" (Coecke et al., 2007;Hartung, 2007a).
A key need for any type of validation is the definition of a prediction model, i.e. the translation of test results into a prediction of adversity.this is a special challenge for the information-rich methods envisaged for tox-21c (Boekelheide and Campion, 2010).there is discussion about whether this needs to be done before a validation exercise, which is certainly more convincing.It is most important that such definition of adversity is not Gaussian, i.e. defining "not normal" by percentiles of the controls: First, values do not typically follow normal distributions, but, more importantly, this sets a prevalence of hazard at the given percentage of significance.With other words, a definition of positive test results by "significant change" is not appropriate.Only a definition of normal (non-toxic) and toxic by relating results to reference compounds with known classification can define adversity.

The concept of evidence-based toxicology
evidence-based medicine (eBM) is a largely accepted process in medicine, which aims to summarize and make available the best available evidence for a medical question in the most transparent and objective manner (Mayer, 2004).Certainly, many aspects of medicine depend on individual factors, values and choices, which are only partially subject to scientific methods.eBM seeks to clarify those aspects of medical practice that are in principle subject to scientific methods.In order to achieve this, the Cochrane collaboration, a group of over 27,000 volunteering physicians and scientists in more than 90 countries, carries out systematic reviews of relevant medical questions, especially therapeutic interventions but also diagnostic strategies.Specific methods, especially statistical tools such as those related to meta-analysis, are developed and results made available via the Cochrane library.Already in 1993, Neugebauer and Holaday applied the principles to animal and in vitro data.
Some similarities to the problems of toxicology and especially the analogy of toxicology and diagnosis stetting (Hoffmann and Hartung, 2005) prompted us and others to call for an evidencebased toxicology -eBt - (Guzelian et al., 2005, Hoffman andHartung, 2006).these include: -The mixture of traditional and scientific approaches -The lack of quality assurance for many methods and the difficulty to retrieve such objective assessments lacking a central depository -The information flood of scientific articles and the expansion of toxicology (larger programs addressing for example untested old chemicals, more legislations in more economically relevant countries, increasing risk avoidance, new products such as nanoparticles, biologicals, and cell therapies etc., concerns about mixtures, new hazards such as endocrine disruption, respiratory sensitization, immunotoxicity, neurodevelopmental toxicity, etc.) Obviously, toxicology has a similar problem of information flooding and coexistence of traditional and modern methodologies, as well as various biases (Wandall et al., 2007).It is most difficult to find and summarize the relevant information for any given major question.Rudén (2001a,b) showed the divergence in judgment and limitations of analysis for the example of 29 cancer risk assessments carried out for trichloroethylene -4 assessments concluded that the substance is carcinogenic, 6 said it is not, and 19 were equivocal.the main reason for this divergence was a selection bias in the materials considered, i.e. an average reference coverage of only 18%, an average citation coverage of most relevant studies of 80%, as well as an interpretation difference of most relevant studies in 27%, and the lack of study/data quality assessment not documented in 65% of the assessments.this indicates a tremendous problem for compiling evidence in toxicology.
-eBM has some key tools to further its purpose, which are little used in toxicology: -Systematic reviews (to be distinguished from narrative reviews, which are the common form in science) -Weighing of evidence by quality scores -Meta-analysis and other statistical tools such as likelihood ratios (an application of the Bayes' theorem especially for diagnostics, here basically the pretest odds multiplied by the likelihood ratio gives the post-test odds), AUC-ROC, i.e., the area under the receiver operating characteristic curve (reflects the relationship between sensitivity and specificity for a given test) or number needed to treat/harm (the effectiveness and safety of an intervention expressed in a clinically meaningful way) -Formalized group decision taking processes, such as the Delphi method or the nominal group technique proach is an important step, but selection bias has not really been addressed and the decision process is very traditional.Systematic reviews of available evidence could serve as a role model (Hartung, 2009b).
In conclusion, retrospective analysis moves validation toward eBM with regard to its tools but leaves it chained to its point of comparison -the traditional method.
6 The similarity of diagnosis setting and toxicology as a bridge between EBM and EBT toxicologists familiar with eBM often have problems translating these approaches to their field because most EBM evaluates therapeutic measures and studies outcome.In toxicology, we have few interventions (restrictions of use), and their outcome can usually not be assessed, as the substance is then simply not used and no experience is gained.However, a growing branch of eBM addresses diagnostic measures (Knotterus and Buntinx, 2009).Intellectually, running a couple of tests and setting a diagnosis or assigning toxicity is not different (Hoffmann and Hartung, 2005).In the end, the process is about detecting or excluding a disorder/hazard by increasing diagnostic certainty as to their presence or absence.We might learn from EBM of diagnostic measures.The first issue is its "discriminative power," typically assessed as a table and deducing sensitivity and specificity.We have several times alluded to the problem of the prevalence of hazards impacting on the usefulness of tests (Hoffmann and Hartung, 2005;Hartung, 2009a), i.e., a sensitivity assessed by studying 25 toxic and 25 nontoxic substances does not reflect the sensitivity of the test in the real world, where only 5% of substances might be toxic.However, we must also ask what the discriminatory power is after cheaper and faster tests have been carried out, for instance, its added value in a test strategy.the value of a cancer bioassay for example will further diminish if we have sorted out mutagens.In testing strategies, each step might make only a rather small contribution; however, to evaluate the importance of such small steps it takes a relatively large study population (number of test substances).It is also of key importance to ask, what is the contribution to the decision making process.Data, which are collected but do not add to decisions (risk management) are, at minimum, a waste of money.
The evaluation of diagnostics may be flawed by several biases, most importantly spectrum bias and selection bias."Spectrum bias" occurs if the diagnostic is assessed in a study population with a different clinical spectrum.translated to toxicology, this might be the case if we validate a method for chemicals but base the validation mainly on drugs because good data are available.Selection bias refers to cases where there is a relation between the test result and the probability of being chosen.this would be the case, for example, if tested in parallel to the traditional test applied to suspicious substances.Observer bias would refer to a situation where (unconsciously) more effort is put into one method than the others or where skill and experience levels differ.
A number of workshops took up this idea (Griesinger et al., 2009), a first tool for quality scoring toxicological studies was developed (Schneider et al., 2009), and the first chair for EBT was created at Johns Hopkins, notably also the site of the US Cochrane center.the eBt concept, however, is still in its infancy.this article is meant to explore whether there is a case of need to develop eBt from the tox-21c movement.

Experiences with retrospective validation as a kind of EBM-like validation
Retrospective validation is an eBt-like approach, but remains anchored in the validation paradigm of reference methods.this means it is based on an evaluation of available information, but the comparison is done against a reference method.An eCVAM workshop (Hoffmann et al., 2008) aimed to open the concept of reference methods by suggesting the use of reference results, i.e., not to reproduce the results of a given method, but establish by weighing evidence the best possible results to be obtained for a panel of test compounds.this means that every substance chosen is evaluated with the relevant information available to assign the test result, which should be achieved for the substance.In other words, the goal would not be to reproduce the results of one animal test, but to identify correctly the positive and negative substances according to an overall assessment especially including human experiences.
Retrospective validation was introduced into the modular approach to validation (Hartung et al., 2004).It has been used successfully to validate the micronucleus test (Corvi et al., 2008) and some variants of in vitro eye irritation tests (Hartung et al., 2010).the cell transformation assays currently under peer review notably used a combination of retrospective data collection and prospective studies.What are the differences between a prospective and a retrospective validation?Obviously, a prospective study allows challenging a defined test according to the current understanding of its optimal test protocol, applicability, and test demands.transfer of method to all laboratories in the round-robin can be controlled and a power analysis can establish the number of substances likely required.At least in principle the result is open; this means that different, more powerful statistics can be used than in a post-hoc analysis.
In contrast, a retrospective analysis depends entirely on what is available.there is an obvious danger of a selection bias here.Regulatory toxicology represents a special challenge, because of the proprietary nature of many data, the lack of incentive for publication (especially of negative data), and the prohibition/ avoidance of repeated testing.Documentation of existing data often is heterogeneous and/or incomplete, impairing data analysis.In general, only when relatively large data sets are available from several sources it will make sense to collect and jointly analyze them.Ideally this should take place as a meta-analysis but this has not been applied to toxicology (Hartung, 2009b).
the eBM movement has its strength, however, in the rigor and transparency of the decision-making process.Certainly, the organization of available evidence according to the modular ap-specificity calculations, the pre-and post-test likelihoods of a diagnosis (hazard) and from these the informative/discriminatory value, etc. (Habbema et al., 2009).Noteworthy, alternative methods usually work by threshold setting making the continuous results of the test (e.g. % cytotoxicity) a dichotomous result (toxic or not).However, the absolute value often has informative value, i.e. a borderline value will usually have less confidence than an extreme value.For diagnostic test strategies, methods for assessment of the predictive value of combinations and what an individual test adds to it are available, especially multiple logistic regression formulations of Bayes' theorem (Habbema et al., 2009); other approaches are treebuilding methods and neural networks (Buntinx et al., 2009).Importantly, in the diagnostic field standards for reporting on diagnostic accuracy studies have also been developed (Bossuyt and Smidt, 2009), which could serve as models for toxicology.even guidance for the systematic review of such studies is available (Horvath and Pewsner, 2004;Buntinx et al., 2009).these examples demonstrate the availability of a rich literature on the evaluation of diagnostic tests, which can be translated to the evaluation of (new) toxicological methods applying eBt concepts.

The need for probabilistic risk assessment
A key feature of eBM is the deduction of probabilities, odd ratios, and confidence intervals.This reflects a scientific reality in the life sciences, where few cases are black and white, at least without tremendous restrictions of the parameters under which the judgment is valid.In toxicology, we have maintained the presumption of a black and white world for a very long time.Substances are either carcinogenic or not, they are irritant or not, corrosive or not… though we sometimes introduce classes like weak or strong.However, all these represent only measures with uncertainties in the test system, uncertainty in threshold setting, and uncertainty in the extrapolation to humans as well as extrapolation to actual usage scenarios.Assigning a probability of a certain hazard appears to be much more adequate than the distinct categories we are using.even better, these probabilities should come with confidence intervals.The goal of our quality assurance/validation must then be to establish such probabilities, which means a completely different type of prediction models.this is much more similar to the reasoning of pre-and post-test probabilities of a certain diagnosis in eBM.
It might be conceptually much easier for regulatory toxicologists to understand that each test shifts only the probability of hazard and the confidence intervals than to think in misclassifications.However, unlike EBM, where a patient is a person with a certain probability of a diagnosis, we are confronted with dose, which means a substance shows a hazard only at a certain dose.thus, a dimension of complexity is added to our assessments, which complicates such test evaluations.It will have to be shown whether we can handle this complexity and especially whether enough data can be made available to enable such assessments.
For diagnostic research, a number of different questions need to be addressed (Haynes and You, 2009): -Phase I questions: Do patients with the target disorder have different test results from normal individuals?translated to toxicology, this is the typical validation paradigm, testing known toxicants and non-toxicants.We typically derive sensitivity and specificity; we might learn from diagnostic tests that these have confidence intervals, which are not always calculated in toxicology.-Phase II questions: Are patients with certain test results more likely to have the target disorder than patients with other test results?translated to toxicology, this is translating to the test situation, i.e. taking into consideration at least prevalence of the hazard (and converting to predictive values) but might extend to practical aspects such as selection and spectrum and observer biases.thus, it is moving the test evaluation from ideal test conditions to the routine situation.We have to admit that this is not typically done in validation in toxicology.there is even some resistance to applying the concept of prevalence.-Phase III questions: Among patients in whom it is clinically sensible to suspect the target disorder, does the test result distinguish those with and without the target disorder?translated to toxicology, this relates to testing strategies."Clinically sensible to suspect" might relate to the presence of alerts and suspicious substances as well as results form other tests; if not everything is tested with the test in question, it is important to understand how this selection impacts on the prevalence of hazard in the tested subgroup.It translates pre-test likelihood of a hazard into post-test likelihood, i.e., whether the test moves us ahead in confirming the hazard.Noteworthy, here all cases impact where either the traditional method or the method under evaluation is "lost, not performed or intermediate"; it is most important that these are not simply excluded from the analysis of accuracy.It is also most important that results cannot be interpreted by the examiner (e.g. by post-hoc changing of thresholds) or are biased because the results of the traditional method are known.A special case is when the selection of substances through suspicions or test results is very effective; then the additional value of carrying out the test in question might become very small, i.e., its discriminatory value has been used up along the way of the testing strategy.-Phase IV questions: Do patients who undergo the diagnostic test fare better (in their ultimate health outcomes) than similar patients who do not?Ultimate health outcomes in toxicology are difficult to obtain; this might be translated to the avoidance of market withdrawals, for instance.-Phase V questions: Does the ultimate diagnostic test lead to better health outcomes at acceptable costs?Cost benefit analyses in toxicology are rare and are difficult if we cannot answer phase IV questions.Still estimates might be helpful to assess the economics of what we are doing (Bottini and Hartung, 2009).We might learn a lot from the literature on diagnostic tests as how to derive the confidence intervals for our sensitivity and evaluation in toxicology.It offers the advantages of stronger emphasis on scientific validity, more transparent processes, and more extensive biometrical evaluation.It also appears to be more flexible and possibly faster to accommodate new evidence as well as test variants, especially compared to prospective validation studies.4. We need to develop meta-analysis tools for retrospective validation and eBt.they will be most useful for the risk assessment process for individual compounds, where several studies are available.For this purpose, the quality score concept for toxicological studies needs to be further developed.5. the concept of pathways of toxicity (Pot) needs to mature (definition, identification, annotation) to develop validation strategies.6. Integrated testing Strategies (ItS) pose an unresolved problem for validation; tox-21c will need to rely to large extent on ItS. the key problems are how to avoid the rigidity of fixed combinations of methods, how to handle the multiple 8 Conclusions table 1 summarizes the key characteristics of traditional validation studies, retrospective validations, eBt-based assessments of methods and eBM of diagnostics.the following hypotheses are put forward: 1. traditional validation allows only substituting a method with something similar and does not accommodate paradigm shifts due to its comparison to the traditional test.the predictivity of a test approach can thus not be changed.2. the pressing need to renovate methods for regulatory toxicology calls for information-rich, complex methods.these represent a challenge for quality assurance (such as GlP) and validation.While test definition and reproducibility can be handled similarly, scientific relevance will need to be stressed to compensate for the difficult predictive relevance in the absence of a reference test.3. eBM for diagnostics can serve as a role model for test  A key problem is the definition of adversity and its interplay with threshold setting, and thus the outcome of validation.tox-21c relies secondly on retro-PBPK, for which no quality assurance and validation experience exists.the validation of the combined results represents the next level of complexity.9. Since a key goal of regulatory toxicology is the identification of non-toxic substances or non-toxic doses of toxicants, tox-21c requires a comprehensive mapping of relevant Pot.
It will be critical to identify areas for exploration of the concept with likely limited numbers of Pot such as endocrine disruption to test the concept.For complex hazards, however, validation for identification of non-toxic substances will only be possible after a type of Human toxicology Project mapping the human toxome.10. the possibility to expand to-21c to ecotoxicology needs to be explored.the conservation of Pot across species is critical here.11.When moving to probabilistic risk assessment, new measures of performance such as post-test hazard probability need to be developed, which operate with probability of hazard and its confidence intervals.This is further complicated as this is a dose-dependent estimate.the challenge to tox-21c will be to steer toward quality control without the creation of obstacles by formal validation.A balance between precaution and innovation is necessary, and this requires informed decisions by the actors in the regulatory arena.eBM has shown how the informed decision process in clinical medicine can be served.eBt promises to be its translation for an informed decision process in risk assessment.

Fig. 1 :
Fig. 1: Scheme of how toxins affect various pathways of toxicity (PoT) in diverse cell types leading to different or similar phenotypes.

Fig. 2 :
Fig. 2: As more substances are tested in more cell systems an increasing number of PoT will be identified until the entirety of PoT is mapped.
testing fallacy, and the high number of reference compounds needed for evaluation of decision trees.7. the discriminatory value of a test changes in an ItS by changing pre-test to post-test probability of hazard.Our validations need to accommodate this by including prevalence and moving to predictive values.8. Tox-21c relies first on PoT, their combinations, and the signatures they leave in information-rich systems.