A Roadmap for the Development of Alternative ( Non-Animal ) Methods for Systemic Toxicity Testing

1.2 A framework for replacing systemic toxicity testing by new approaches 7 1.2.1 Abolition of useless tests 7 1.2.2 Reduction to key events 8 1.2.3 Negative exclusion by lack of key property 8 1.2.4 Optimization of existing tests 8 1.2.5 In silico approaches 8 1.2.6 Information-rich single tests 9 1.2.7 Integrated testing strategies (ITS) 9 1.2.8 Pathways of Toxicity (PoT) and systems toxicology 11


A roadmap for the development of alternative (non-animal) methods for skin sensitization testing
27 3.1 Introduction: skin sensitization 27 3.2 What is it that we are really trying to achieve -what will success look like? 27 3.3 Is the international scientific community marshaled in the right way to make real progress in this area? 29 3.4 What should be the future research imperatives? 29 3.5 Skin sensitization testing in vitro -can we do it already? 30 3.6 Is hazard identification alone good enough? 31 3.7 What needs to change? 31 3.8 Conclusions and recommendations: skin sensitization 32

Testing needs for the REACH legislation
As an enormous investment into consumer product safety, the ReACH program aims to assess existing ("old") chemicals that have previously undergone very little testing (Hartung, 2010a). Regulation (eC) 1907/2006, known as ReACH (Registration, evaluation, Authorisation and Restriction of Chemicals), revises the Dangerous Substances Directive (67/548/ eeC). the registration process has only recently begun, and the estimated testing demands are under debate (Hartung and Rovida, 2009a,b;Rovida and Hartung, 2009;Rovida et al., 2011). However, there is little doubt that systemic toxicity will account for more than 95% of the testing costs and animal use of ReACH. It is clear that testing capacities are challenged and alternative approaches, especially for systemic toxicities -as called for in the legislation -might relieve such tensions.

A framework for replacing systemic toxicity testing by new approaches
the advantage and disadvantage of alternative methods lies in the reductionist character of their approach. this eases interpretation to the extent that a simpler read-out is likely to result from such an approach, but raises the issue of what aspects of the biology might be missing. Aside from abolishing useless tests (1.2.1) (which is not an alternative method but should nonetheless be considered as an option), a number of principal alternative approaches (1.2.2 to 1.2.4 and 1.2.6) were identified. these include in vitro and in silico (1.2.5), as well as combined approaches (1.2.7-1.2.8), by either mining or modeling the respective data and/or relating them back to structure and other properties of the test substance.

Abolition of useless tests
A cost-benefit analysis could help in making decisions to abandon tests of questionable practical utility. Such considerations may be based on reproducibility issues, lack of predictivity, lack of scientific basis, or limited contribution to regulatory decision-making. Obviously, "uselessness" is a value judgment. For animal tests, a number of limitations (Hartung, 2008b) can be evaluated in terms of whether they translate to the given test. the socioeconomic impact of wrong or missing assessments needs to be taken into consideration Hartung, 2009, 2010), along with other sources of information to substitute for comparison of performance characteristics with other methods. tests that have been abolished in the past include the traditional lD50 test (OeCD tG 401;OeCD, 1987), the abnormal toxicity test for vaccines, and the ascites mouse for the production of monoclonal antibodies.

Background
two pieces of european legislation have created the pressure to develop novel approaches for systemic toxicity testing, beyond the general urge to replace animal testing as prescribed in the european Directive 2010/63/eU on the protection of animals used for scientific purposes (Hartung, 2010d;Seidle et al., 2011). this report deals with methods for the testing of all chemicals, and does not focus only on cosmetics. this activity is aimed at providing a scientific roadmap for the replacement of animal based safety testing in all domains. 2

The 7 th Amendment of the Cosmetics Directive
On January 15, 2003, the eU passed a law banning the testing of cosmetics and their ingredients on animals, reinforced by marketing bans with different deadlines. Known as the 7 th Amendment (Directive 2003/15/eC) to the Cosmetics Directive (Directive76/768/eeC), this Directive is intended to protect and improve the welfare of animals used for experimental purposes by promoting the development and use of scientifically valid methods of alternative testing (Hartung, 2008a). the main objective of this Directive is to prohibit the testing of cosmetic products/ingredients on animals through a phased series of eU testing and marketing bans. this ban on animal testing and sales would start immediately where alternative non-animal tests are available, followed by a complete testing ban six years after the Directive became effective (i.e., in 2009). therefore, animal experiments for cosmetic products and ingredients are completely banned, reinforced with a marketing ban in the eU since 2009, irrespective of the availability of animal-free methods, except for repeat-dose toxicological endpoints (i.e., toxicokinetics, repeated dose toxicity, skin sensitization, carcinogenicity, and reproductive toxicity) where the eU marketing ban is delayed until 2013 for tests carried out outside the eU. this ban may, however, be postponed by a new legislative act if alternative tests cannot be found. 1 The introduction text was largely part of the original whitepapers on carcinogenicity and reproductive toxicity and discussed in this context. 2 At the workshop I. Ruhdel pointed out that from an animal protection point of view the workshop should not be seen or communicated as an activity in the context of the current discussion on possibly postponing the marketing ban on animal tested cosmetics.
-standardization and automation -quality assurance of procedures -appropriate statistics and prediction models -definition of applicability domains -extensions to address solubility issues and nanomaterials these opportunities will differ from test to test. they can improve the predictive value of tests, making them (more) fit for purpose. Such changes, however, will typically require a (re-) assessment of the validity of the modified system.

In silico approaches
A number of approaches (Hartung and Hoffmann, 2009) try to link, often via structure and physicochemical descriptors, to results available for other substances to avoid testing. they are somewhat similar to what is referred to as "read-across" but in a formalized and quantitative way, using either rules, empirical correlations to parameters of interest, or other modeling exercises. For complex endpoints, such models are unlikely to be used as stand-alone replacements, but are better suited to provide valuable supporting information as part of a weight of evidence  approach. they can play a key role in combination with other tools or to further optimize biological measurements. It is foreseeable that some Integrated testing Strategies (ItS, see 1.2.7) developed in the near future actually will be in silico tools with biological inputs. the basic problem is that we base our judgments on existing knowledge and its availability and quality. Surprising effects can hardly be predicted, and all quality limitations of this existing knowledge (e.g., quality of animal test data or mechanistic understanding) will translate to the estimation technique. this is not unique to modeling approaches per se, but it is important to note that the value of existing information (see 1.2.1) is again the critical starting point. While there are established measures of similarity of chemicals, these merely address structural similarity and do not consider the context of the endpoint of concern. thus, even if we assume that we may have a fair appreciation of structural similarity, understanding whether this is key for the distribution of the chemical in the organism and its toxic mechanism is an additional consideration.
A limitation of all these techniques is that they can only be readily applied to discrete organic substances. that suggests, based on rough estimates, that some 50% of the chemicals impacted under ReACH, which comprise mixtures, lack of purity, salts, metal compounds, etc., cannot be readily evaluated using modeling approaches (Hartung and Hoffmann, 2009). Furthermore, all health effects where small impurities are relevant cannot be handled with such structure-based estimation techniques: Allergic reactions (sensitization), for example, can be caused by less than 0.1% of contamination. With the same reasoning as for possible contaminants, health effects, where no thresholds can be established (carcinogenic, mutagenic, or some reproductive toxicants), should not be evaluated on the basis of the structure of the main compound only (while these contaminants are typically present in in vivo or in vitro tests). It is noteworthy that these are exactly the tests that consume the most animals and resources (>80%) under ReACH.

Reduction to key events
traditional 3Rs or alternative methods have been aimed at a one-to-one replacement of animal tests. this appears to be feasible if a key (rate determining) event can be readily identified. examples of such attempts include key events such as mutagenicity, or possibly cell transformation, for carcinogenicity, whereas protein binding is assumed to be a prerequisite for skin sensitization. the selection of key events can be informed by the scientific understanding of the pathophysiology or through analysis of what was derived, i.e., what was actually observed in guideline tests that drove the classification (for example, which organ toxicities are actually driving regulatory decisions) or is seen in intoxicated patients (humanrelevant manifestations). the obvious central question is: Can a key event for the given hazard or test concern be readily identified? The scientific challenge lies in the state of mechanistic understanding -i.e., for some toxicological endpoints a single non-animal test can be used to sufficiently characterize the adverse effects of the chemical. For other, more complex endpoints, several non-animal approaches are required to fully characterize the impact of the chemical on the relevant tissue(s).

Negative exclusion by lack of key property
the most prominent example of exclusion criteria (conditio sine qua non) are large molecular size or barrier models -no bioavailability/penetration, no harm. the obvious problem is the reliance on negative data (no transfer). this concept is further refined by the threshold of toxicological concern (TTC) approach (Kroes et al., 2005), where exposure (and thus resulting availability in sufficient quantity) -not absolute bioavailability -is evaluated: For non-cancer endpoints, NOAels or, alternatively, tD50 (toxic dose 50%) values are collected for a large number of chemicals and their distribution is used in combination with a safety factor to set a threshold where no adverse effect is expected. ttC values have been derived for different structural classes, e.g., Cramer classes, while other ttC have been derived and subsequently refined on the basis of specific structural alerts for genotoxicity.
Similarly, many toxic endpoints rely on reactive chemistry allowing interaction with target structures. the absence of structural features allowing direct reactivity or activation via metabolism represents another example of exclusion of a hazard.

Optimization of existing tests
In vitro tests have no fewer limitations than their in vivo counterparts (Hartung, 2007a). A number of strategies may be able to improve the predictive value of existing test systems: -extension of metabolic capacity -organotypic 3-dimensional (co)-cultures -more physiologic culture conditions such as homeostasis, oxygen supply, cell density -transition from cell lines to primary cells or stem cell-derived systems -use of human cells -refinement and expansion of endpoints measured alone alternatives and are now combined because they did not achieve this. the downside to this might be that they are not sufficiently complementary to make a major change in an ITS. the systematic construction of components for an ItS represents a key opportunity to advance the overall ItS approach. A very promising way of constructing a testing strategy is breaking the (patho-)physiology down to crucial elements, e.g., the different elements of the reproductive cycle (as was done for the ReProtect project) (Hareng et al., 2005) or the key processes of neurodevelopment in the series of DNt workshops. However, this still leaves open the question of how to integrate all these tests. the concept of ItS was advanced substantially during the development of the ReACH technical guidance (Schaafsma et al., 2009). Regulatory toxicology to date has been developed as a toolbox of tests, which allows the health effects of new substances (especially pre-market drugs and pesticides) to be classified before carrying out a risk assessment. Given that little was known about the inherent properties of a given chemical and minimal information about possible future uses was available, each test within the toolbox was optimized to have as few as possible false-negative results, which might represent a later safety risk. Indeed, in the absence of information it is preferable to "over-label" a possible hazard, often called the "precautionary principle." As a consequence, an unknown proportion of substances are abandoned based on false-positive test results in their development as drugs or consumer products, but this is usually accepted since similar substances with favorable profiles are available as alternatives in the test battery. Note that this situation is completely different for ReACH purposes, where the same test methods need to be applied to test valuable commodity substances with a significant history of safe use.
Many tests in the toxicological toolbox are dichotomous, i.e., they can have only two outcomes (positive or negative). this suggests that when optimizing the test for few false-negatives, the number of false-positives is increased. However, even the simplest biological aspects are not dichotomous: Sex is male or female, but what about transvestites, transsexuals, hermaphrodites, castrates, turner (only one x chromosome) or Klinefelter (XXY) syndrome? There is a grey area. When we set our thresholds, we determine the extent of grey and whether we favor false-positives or false-negatives. Due to the precautionary approach in toxicology, thresholds are set to minimize false-negatives, thus favoring false-positives. Although some non-animal test methods have prediction models with only binary outcomes (often to reflect the reference test result), this is rarely the way they are applied, and most nonanimal test methods are now being designed to predict dose response information.
the "one suits all" philosophy of the animal test toolbox leads to the problem that usually only one test is available to give the final result. This means that the proportion of falsepositives cannot be corrected. even worse, if several tests allowing false-positives are combined, e.g., the mutagenicity test battery or testing in several species for repeated dose toxicity, the role of in silico techniques will principally be within ItS, not as stand-alone replacements. they will support other types of information, help to prioritize and -following evaluationincreasingly substitute for testing. It might be that they can serve as 2 nd generation alternative methods, i.e., modeling validated in vitro methods, because these more simple but standardized tests allow for the generation of large datasets, which would facilitate modeling of new key events.

Information-rich single tests
the sensitivity of the test system, i.e., here the spectrum of interactions with xenobiotics covered by the test system, can be increased by measuring more endpoints, e.g., by omics or high-content imaging. this can be done by supervised analysis (measuring known biomarkers or hazard pathways) or in an unsupervised manner by testing for any response, which only then is interpreted as a signature of effect. Prominent examples are cell systems combined with transcriptomics, proteomics, or metabolomics. this typically will lead to signatures of toxicity (Sot), such as a reduction of information to patterns of signals associated with the hazard. Notably, identifying biomarkers from the variety of signals should shift the approach away from the more traditional (1.2.2) and (1.2.3) approaches. High-content measurements, such as image analysis, represent other technologies increasingly applied here. We should bear in mind that even the most sophisticated measurements and bioinformatics can hardly overcome the limitations of the cell systems. therefore, the experience gained with the development and validation of alternative methods with simple endpoints is of critical importance when moving towards wholly novel technologies. Good Cell Culture Practices form only one example here (Coecke et al., 2005;Hartung, 2010b;leist et al., 2010;Wilcox and Goldberg, 2011).

Integrated testing strategies (ITS)
In every case where no single property or single test system can be identified to cover a hazard, tests will need to be combined and results integrated. One key example is the combination of toxicity data on the one hand (e.g., derived in vitro and/or in silico with kinetic data (e.g., modeling) in ItS, see, e.g., DeJongh et al., 1999;Forsby and Blaauboer, 2007;Blaauboer, 2010). the purposes of combining tests can be: -covering different mechanisms or applicability domains -increasing the predictive value compared to a single test -avoiding costly tests or animal tests by filtering out certain substances -adding kinetic information to hazard evaluations -integrating existing data In the simplest case an ItS is a battery of tests, and any positive result is taken as an indication of toxicity, as is the case for the combined mutagenicity tests. More sophisticated combinations with interim decision points are emerging (Jaworska et al., 2011;Jaworska and Hoffmann, 2010), but accepted concepts regarding how to construct and validate them are not available. A major problem seems to be that most methods now being combined into ItS were originally developed to work as stand-often a switch to a substance with a better toxicological profile but with the possibility of a similar effect is possible. Normally there is no time to rule out false-positives. False-negatives, however, represent possible disaster (not only the worst case, when successful drugs have to be withdrawn from the market, but also when expensive clinical evaluations have to be stopped because of side-effects or the need for additional toxicological studies).
For chemicals and consumer products, the situation is, in principle, very similar. For work safety, over-labeling is not very critical, and for consumer products there is often a choice among less critical chemicals. It is telling that more than 90% of the new chemicals notified are not acutely toxic (more than 50% of the animals survive a dose of 2 g/kg); this means that such non-toxic substances mostly have been further developed to applications that reach the market and notification.
Several business impact studies have been carried out for ReACH. the fundamental problem of applying tests optimized for new chemicals to existing chemicals, however, has so far escaped attention: How much effort will be spent to demonstrate that a result is indeed a false-positive? typical measures include: -repetition of the test -testing in a second species -mechanistic studies -identification of critical metabolites and possible species differences to humans -exposure scenarios All these measures are as costly as, or sometimes even enormously more costly than, the original test. Worse, they always leave some doubt with regard to the substance. thus, it is critically important that the number of false-positives be limited up front. In the field of carcinogenicity, in particular, the precautionary principle produces many false-positive results. It is well known that the in vivo test for carcinogenicity has produced enormous numbers of false-positive results already (see below). In addition to the in vivo test for carcinogenicity, the current in vitro test battery for mutagenicity, i.e., the combination of two tests, results in a false-positive rate of 65-90% for non-carcinogenic substances. this means that the already high proportion of false-positive results from the cancer bioassay will be further increased by an enormous number of non-carcinogenic substances showing a genotoxic effect in one of the two in vitro tests. Furthermore, aspects of variability related to a test, e.g., inter-animal variation, or within-or between-laboratory variability, can cause false-positive results.
ITS do more than define how to test strategically; they also determine whether to test at all, as existing and non-testing information can also be integrated. there are three reasons why testing of a substance might not be necessary: -Available information on a given substance is sufficient.
-Information on related compounds is sufficient to extrapolate. -exposure or uptake by the organism is so low that testing can be waived. these three aspects have to be separated from creating new reproductive toxicity, or carcinogenicity, a further increase in the proportion of false-positives will arise. this will be the case particularly when non-specific tests are used for relatively rare hazards (Hoffmann and Hartung, 2005). In such a case, the false-positives likely outnumber the real-positives, e.g., by tenfold in case of the cancer bioassay (see below).
the extent of false-positives also is determined by the number of replicate animals. In its most typical application (discriminating between non-responding and responding animals), the use of replicates again reduces false-negatives and increases falsepositives. Similarly, multiple testing increases the number of false-positives. Setting a significance level of 95% implies that one out of 20 results is false-positive. to date, the cancer bioassay includes more than 60 endpoints, the reproductive twogeneration study 80 and a 28-day repeated dose study 40 -arguably, it is difficult for any substance to test negative. The same reasoning holds true for other tests. the more tests done on a single chemical, the more likely that there is a positive result in one. A cynic might conclude that a non-toxic substance must be one that has not been tested often enough.
ReACH foresees the application of the toxicological toolbox to existing chemicals of often enormous economic value. the costs of ReACH have been calculated until now on the basis of the actual costs of the tests that would be required to prepare the dossiers. The consequence of false-positive classifications is largely overlooked, at least by regulatory agencies, though the potential impact is not lost on companies. the consequences include unnecessary restrictions of use and safety measures, unjustified abandoning of chemicals, or laborious follow-up studies to rule out a particular unwarranted safety concern. the only rational exit from this dilemma is through a combination of tests -a test strategy, where at least one sensitive (few falsenegatives) test and one specific (few false-positive) test are combined. Integrated testing Strategies (ItS) are needed.
there is a fundamental difference between the testing needs of new versus existing chemicals: Any new chemical represents a possible health hazard, while the longer a chemical is in use, the lower the uncertainty. After the creation of a new chemical, its utility is uncertain; while the longer it is in use, the more its economic value becomes evident. the consequence is simple: false-positive toxicological results are less and less tolerable. While we tend to accept the result of a toxicological evaluation early after generation of a chemical and uncertainty is not welcome, for advanced chemicals in broad use it is unavoidable that problematic test results be questioned.
Drug development represents a good example of the attitude toward new substances, especially since this field has pioneered and shaped our toxicological approach. Classically, i.e., when the toxicological toolbox was developed, around 10,000 substances were synthesized and evaluated to bring one product to the market. Since the bulk of the cost is generated in the clinical phase of development, and toxicological studies represent an "entry permit" for first-time testing in humans, an early and clear statement on the health hazards of a substance is most important. typically, a broad variety of similar substances related to the lead compound under development are synthesized and exposure/bioavailability-based waiving represents another key decision point in many ItS. For most health effects (most likely even for cancer and reproductive toxicology), a minimum concentration must be reached in the target tissue. If this can be excluded due to exposure scenarios and/or limited uptake by the organism, it might not be necessary to conduct further testing. However, this means that the judgment is not definite but depends on chemical use (exposure scenarios and route of application). this approach is most promising for cosmetics, where clear exposure scenarios are given. It also can apply to strictly controlled intermediates when containment can be assured by appropriate risk management measures, and hence ttC type approaches can be useful to set "health benchmarks" for exposures because likely exposure scenarios can be formulated. It is worth noting that the best-established alternative approach to assess uptake is the one for skin absorption (OeCD test guideline 428;OeCD, 2004), again favoring applications for cosmetic ingredients. At the same time, we need ways to incorporate dermal absorption into risk assessments under ReACH, rather than being forced to live with conservative 100% defaults.
When composing and validating a test strategy, it is crucial to assess the performance characteristics of all building blocks. emerging methodologies (e.g., from Bayesian decision theory) may provide valuable tools for strategic development (Jaworska and Hoffmann, 2010). Some principles for ItS are evident: -Combine sensitive and specific tests; combine screening and confirmatory tests. Pertinent examples are mutagenicity tests, where the positive results of a battery of usually two in vitro tests (accepting a huge proportion of false-positives, i.e., 95%) are subsequently ruled out by the animal experiment. -For rare health effects, identify the negatives; use prioritization to increase frequency of positive results. -Assigning a test result means reducing information; combination of raw data from two tests might be more powerful than combining two final test results. -For the mutagenicity test battery it has been shown that tests of low predictivity on their own can be combined to result in highly predictive tests (Jaworska et al., 2005). -Allow interim decisions to obviate further testing (tiered testing strategies). -Conduct inexpensive and/or non-animal tests first.
-Interlink tests for various health effects, e.g., using the same control groups or addressing several endpoints in one animal study (beware of multiple testing).

Pathways of Toxicity (PoT) and systems toxicology
Our scientific understanding of how genes, proteins, and small molecules interact to form molecular pathways that maintain cell function is evolving rapidly. Pathways that lead to adverse health effects when perturbed are referred to as Pathways of Toxicity (PoT). The exploding scientific knowledge of mode of action in target cells, tissues, and organs, driven by advances in molecular and computational tools and coupled with the con-knowledge. Again, the strategic combination of individual tests is often needed. Combinations of tests are required when the performance of one test cannot suit all needs. the following aspects have to be taken into account to optimize the approach for a given purpose: -work load and costs -animal consumption -certainty of result and resulting safety level -applicability, e.g., for chemical classes Components of ItS other than testing in vitro or in vivo are: -Use of existing information: Possible sources of information will differ for given substances and fields. The most important questions are how to retrieve them and how to judge their quality (and, thus, their utility). Quality of science does not depend on quality measures like ISO or Good laboratory Practice, but such quality-assurance programs safeguard proper documentation and the reliability of results. Similarly, adherence to international test guidelines is not a prerequisite for good toxicology, but it facilitates comparability and acceptance. It will be necessary to agree on criteria for each given purpose, which might benefit from the development of scoring systems for the quality of studies and possibly thresholds for acceptability. -extrapolation from existing information: Several ways of using information on other chemicals have to be distinguished: -read-across (interpolation from existing data of related chemicals), i.e., the data gap filling conducted within a category of substances -chemical grouping (testing of prototypic compounds out of a group of similar ones only) -structural alerts and rule-bases (structural characteristics that raise concerns or rule out possible hazards (SARstructure activity relationships)) -(quantitative) structure activity relationships, i.e., (Q)SAR (correlation of chemical characteristics -physicochemical descriptors, with activities) the basic question is intriguing: can we use information on similar chemicals to draw conclusions for those for which we have no test results? Certainly not always. Who could possibly predict that a shift of an OH-group in a dioxin molecule changes the potency a thousand fold? The question is whether the uncertainty of such estimation techniques is larger than the uncertainty of tests and interspecies predictivity. Few formal validations have been initiated for some methods ((Q)SAR and rule-based systems). there are parallel efforts underway elsewhere to define which scientific principles and approaches are merited to confirm and justify the appropriateness of a read-across. In general, formal validations are avoided and instead concrete examples to help benchmark potential acceptance under regulatory frameworks by establishing consistent approaches dependent on context for each chemical and endpoint under consideration are needed. Some similar assessments of read-across approaches and chemical grouping will be necessary. However, concepts for validation -especially of ItS -are only emerging (Kinsner-Ovaskainen et al., 2009). ing strategy in 2009. Depending on the proponent, more or less emphasis is given to technological updates, throughput of testing, costs, replacement of animal testing, or quality of toxicological assessments. there is no doubt that all aspects synergize to bring about a potentially revolutionary change (Hartung, 2008c).
Although a broad discussion has ensued on the design and feasibility of the new toxicity testing paradigm, we are only at the beginning of such a shift. Recognizing that success will require a long-term, concerted effort by many investigators working in a coordinated manner, two NIH institutes (NHGRI, NIeHS), along with ePA and FDA, entered into a formal collaboration in 2009, now known as tox21. these partners have demonstrated high-throughput screening assays to identify toxicity pathways and are developing computational models and analysis, and informatics tools -all of which can be leveraged for this project.
Although there is not yet a consensus definition for PoT (concepts range from perturbed physiological pathways to adverse outcome pathways, modes of action, or signaling cascades), the general idea is to develop a field of systems toxicology using systems biology as a "role model." Parallel developments in all fields of the life sciences will support this, but toxicology has some features that will help drive its development: -an urgent need for change -immediate commercial applications -reference substances to induce toxicities -the foundation of (pre-)validated alternative methods from $ 500+ million of research funding -a culture of Good laboratory Practice (GlP), Good Cell Culture Practice (GCCP), and validation (and increasingly eBt) for quality control comitant development of high-throughput and high-content screening assays, enables interrogation of these Pot and provides a means to study and evaluate the effects of thousands of chemicals. A number of PoT have been identified already; however, most Pot are only partially known, and no common annotation exists. Mapping the entirety of these pathways -a project we have termed the Human toxome -will be a large-scale effort, perhaps on the order of the Human Genome Project. the 2007 NRC vision document, Toxicity Testing for the 21 st Century -a Vision and a Strategy (Krewski et al., 2010), has strongly endorsed the concept of Pot. this vision embraces new high-content, high-throughput, and bioinformatics tools for identifying Pot. europe and the US have pursued the development of new toxicological tools in very different ways (Hartung, 2010b). the NAS/NRC tox-21c report calls for a paradigm shift in toxicology. In February 2008, several American agencies, recently joined by the FDA, announced a coalition to facilitate its implementation (Collins et al., 2008): "We propose a shift from primarily in vivo animal studies to in vitro assays, in vivo assays with lower organisms, and computational modeling for toxicity assessments." In USA Today of the same day, Francis Collins, now Director of the National Institutes of Health, stated: "(toxicity testing) was expensive, time-consuming, used animals in large numbers, and didn't always work." In the same article, elias Zerhouni, then Director of NIH, said: "Animal testing won't disappear overnight, but the agencies' work signals the beginning of the end." Only four years after publication of the NAS/NRC report, we have seen numerous conferences and symposia addressing the report and its implementation, the formation of an alliance of US agencies, and the development of a new ePA toxicity test- The identification and use of PoT is the basis for undertaking a revolutionary approach to toxicity testing. Although modern toxicology has identified many modes of action, they have largely remained isolated mechanisms that cannot be broadly applied to sufficient numbers of toxicants to warrant the establishment of dedicated toxicity tests, and they do not yet satisfy regulatory needs. This means that our proposed PoT definition and development of novel test strategies not only initiates a novel test paradigm in general, but will also benefit specific screening programs. It aims to change the general toxicity testing paradigm. the key challenges to this are: -a harmonized definition, annotation, visualization, and sharing of Pot. -strategies from systems biology for PoT identification and their validation.

-composition of integrated testing strategies based on these
PoT with a definition of adversity and subsequent translation to a risk assessment paradigm. Mapping the Human Toxome will be a first step towards the development of a Human toxicology Project. In contrast to the currently used phenomenological "black box" that is animal testing, pathways of toxicity (PoT) will be identified primarily in human in vitro systems to provide more relevant, accurate, and mechanistic information for the assessment of human toxicological risk. the ultimate future goal is to bring together a broad scientific community to map the entirety of the Human toxome.
the concentration at which a substance triggers a Pot will be extrapolated to a relevant human blood or tissue concentration and, finally, a corresponding dose by (retro-) PBPK (physiology-based pharmacokinetic) modeling, informing human risk assessment (Adler et al., 2011). Perhaps more importantly, if a substance does not trigger any of these PoT, it may for the first time be possible to establish the lack of toxicity (i.e., safety) of a substance at a given concentration. this project will need to combine several of the latest emerging technologies in life sciences. transcriptomics and metabolomics currently are the most advanced technologies for pathway identification, but these are rarely combined to map pathways. the main difference from ItS is that this approach will operate at the subcellular level and break modes of action and mechanisms down to the underlying pathways or the perturbation of physiological pathways (notably, two very different definitions). The term pathway might be misleading, as we are more likely referring to perturbations of networks. the approach only becomes meaningful if a common annotation of Pot is developed. Hence, a central repository of Pot constituting the (Human) toxome can be created (Hartung and McBride, 2011). this might serve in the future to identify Pot associated/ crucial/amplifying or pathways of defense (PoD) protecting/reversing/dampening a given hazardous effect. the link to classes of substances, cell populations, species, or resulting phenotypic changes will foster the understanding of the specific effect. toxicology is increasingly embracing the technologies of the 21 st century (Bhogal et al., 2005). the discussion surrounding tox-21c has accelerated this process, as many have started to develop and commercialize these technologies, which lend themselves to the vision's implementation (van Vliet, 2011). this parallels developments in all life sciences implementing and exploiting the new technologies. Unlike most medical questions, toxicology has the advantage of having a relatively clear start and end to the pathways, i.e., defined substances and hazards, as compared to usually multi-factorial contributors to disease and complex manifestations impacted by individual constellations of the patient.
the basic idea of tox-21c is to change in the level of resolution. In a nutshell, biochemistry/molecular biology are used to describe phenomena versus physiology/cellular pathology, which, so far, have been used predominantly when discussing modes of action. Figure 1.1 illustrates the larger perspective on the evolution of approaches: technologies have developed over the last century from animal to in vitro/in silico and, more recently, mode of action resolution. the concept of tox-21c is to further refine resolution of analysis to the molecular basis of Pot. these technologies correspond to different quality assurance measures, however, where the validation of ItS (typically built from combining mode of action tests) and evidence-based toxicology (eBt) (Hartung, 2009b;Hoffmann and Hartung, 2006;Griesinger et al., 2007) are only emerging. The figure captures how current regulatory toxicology is formed by the earlier technologies, leading to a deterministic (point estimate), typically precautionary risk assessment. the vision is that the new tools of mode of action models, their combination in ItS, and the Pot-based emerging technologies allow the formulation of a Systems toxicology approach. As discussed elsewhere (Hartung, 2010c), these integrated and information-rich assessments require a shift to a more probabilistic evaluation, where each and every test changes to some extent the probability of a hazard and/or its uncertainty.
the Pot approach represents the continuation of omics by reducing phenotypic characterization ("signatures") to the underlying Pot. this introduces a new quality -that of converting correlations into a hypothesis that can be tested or, in other words, validated. Pot can be manipulated (blocked, triggered) or PoT-specific assays can be designed.
We hypothesize that the number of PoT is finite. This corresponds with the idea that the number of vulnerable targets of a cell (its critical infrastructure) is finite. If this is the case, or at least if a limited number of Pot can cover a large number of agents and hazards, then a comprehensive list of Pot (the Human toxome) (Hartung and McBride, 2011) will allow us to describe toxic effects at a new level of resolution. We will be able to annotate Pot to cell types, hazards, toxin classes, species, etc., in a manner similar to how we currently annotate (transcribed) genes. It is important to note that the Human toxome will not be populated by a single test and a single measurement independent of its information-richness, but will require the confirmatory combination of various models and technologies. Pilot projects for endocrine disruptors, funded by NIH, and de-and where often any significant response is taken as threshold, often rendering the systems overtly responsive. the problem of defining adversity (Boekelheide and Andersen, 2010;Boekelheide and Campion, 2010) can therefore be correlated with the thresholds of the prediction model of the alternative method they were identified in. Alternatively, methods trying to define the point of departure of biological responses are emerging (Judson et al., 2011). However, this is only a first step to finding acceptable methods to distill results from the rich datasets suitable to inform a risk assessment process. A prime example was given in 2010: the quick evaluation of dispersants used for the gulf oil spill disaster  shows that the new technologies can indeed deliver such information in a timely and cost-saving manner.

Need for probabilistic risk assessment
In order to make use of the novel high-content, high-throughput, and Pot information, we also need to develop ways of distilling relevant information out of the large datasets that will be produced. this requires a radical change from the past: traditional hazard identification methods have been descriptively based or based on empirical studies, which are resource-intensive and inefficient (see above). Furthermore, empirical studies lack the capacity to detect low probability events, such as those experienced in low dose carcinogenicity. the current deterministic methods are based on point estimates, which are almost always worst-case estimates. In order to improve the transparency, consistency, and objectivity of the assessments, a need for more formal approaches to data integration has been recognized (OeCD, 2009). three main conceptual requirements for a multi-test decision framework, based on integration of multiple pieces of evidence and a decision-theoretic setting, have recently been formulated . According to the analysis, the framework must: -be probabilistic, in order to quantify uncertainties and dependencies; -be consistent by allowing reasoning in both causal and predictive directions; -support a cyclic hypothesis and data-driven approach, where the hypotheses can be updated when new data arrive. the formal framework that potentially meets these requirements, allowing for evidence maximization and reduction of uncertainty, can be found in Probabilistic Risk Assessment Networks (PRA). These PRA methods are designed specifically for prospective analysis of the likelihood of low probability events (Greenland, 1998). PRA tools are not new to the risk assessment process (Jager et al., 2001;Verdonck et al., 2005) and they have been used mainly in the derivation of exposure assessment scenarios. the intent is to shift the emphasis of these tools to hazard identification and use PRA to analytically assess the probability that a substance could potentially cause harm. the advantage of PRA is that uncertainties are transparently taken into account, and the cautionary aspect is left to the risk management process. ePA toxCast has started to develop a risk assessment framework based on high-throughput test systems (HtS) data (Judson the critical question is whether there is a limited number of PoT? It is likely that the number of critical cellular infrastructures is limited, which means that the points of vulnerability, to which the Pot would converge, should also be limited.

Definition of PoT
There is no generally accepted definition of PoT. First, PoT are causal in contrast to adaptive pathways. We might define as overarching xenobiotic Response Pathways, which include Pot, pathways of defense (PoD) and epiphenomena (epiP), which do not affect the manifestation of the altered phenotype. Note that epiP can still serves as biomarkers if triggered consistently with the Pot, but blocking them would not alter the manifestation of toxicity. Three proposed definitions are: PoT are molecularly defined chains of not necessarily linear cellular events stretching from point of chemical interaction to perturbation of metabolic networks and phenotypic change. PoT are causal -either necessary or aggravating -and will typically have a threshold of adversity.

Or
PoT are the formal description of toxic modes of action on the resolution of underlying biochemistry and molecular biology.

Or
PoT are causal links between a given toxicant and its effect in a systems toxicology approach.
These definitions distinguish PoT by molecular resolution from MoA and by causality from signatures/biomarkers. It leaves open the interactions between different Pot (synergies, leading "pacemaker" Pot, etc.) and of Pot with PoD. three very different approaches were taken to explore the concept: toxCast of the US ePA (Judson et al., 2011;Kavlock and Dix, 2010) uses a broad variety of from the shelf available pathway assays to characterize biological profiles of substances in an HtS manner to associate these with their (mainly animal) toxic profile. The "Hamner approach" (Andersen et al., 2011) selected some known relevant pathways to explore the Pot concept. the approach spearheaded by CAAt (Hartung and McBride, 2011) aims for an unsupervised identification of Pot by omics technologies. the latter was just awarded an NIH transformative Research grant, "Mapping the Human toxome by Systems Toxicology," which aims to further define, annotate, and validate Pot as well as create a public database to share Pot from various groups and fields. The consortium includes both the Hamner Institutes for Health Sciences and toxCast, thus raising the possibility of merging and synergizing the different approaches.
Formally developed alternative methods have one major advantage compared to the research models typically found in the literature: beside their higher degree of standardization and documentation, they need to include a prediction model, i.e., a formal algorithm for deriving predictive results. this means that the level of response indicating adversity is defined. This is rarely the case for tests, which have not been formally evaluated,

Transition in regulatory toxicology
Developing the technologies, however, is only a first step. A possible transition to a new regulatory toxicology based on Pot represents an enormous and multi-faceted challenge (Hartung, 2009d) there is a need for objective assessment, e.g., by evidencebased toxicology, to assess traditional and novel approaches. -Making it a win/win/win situation: every stakeholder will not be happy with new approaches that are more complex and more circumspect with regard to certainty of its result. We have to demonstrate the compensatory advantages of better predictivity. et al., 2011) that has kinetic, mechanistic, and uncertainty components. Building on this approach, extending it to high-content (omics) data, and analytically combining the information within a PRA-based Bayesian network, is the logical next step. Regulatory science is, for practical purposes, bound by the concept of classification and labeling to definitively assign a substance to hazard classes. Science, however, can only deliver probabilities (Hartung, 2010c). this is due to the nature of the underlying data: Biological objects we test are highly variable, and there are other uncertainties associated with diagnostic errors (Hoffmann and Hartung, 2005). this comforts neither the regulator nor the regulated players, as it impedes definitive hazard judgments and the resulting decisions. tests change the pre-test to a post-test probability of hazard Pepe, 2004), reducing uncertainty. This new understanding analytically refines the initial hazard information. the paradigm change like this will also allow new methods to enter the regulatory arena more easily, as these refined methods are not perceived as a "game-changing," full replacement, but as changers of probabilities. With the successful PRA use in estimates and hazard judgments, its impact will grow and -we hope -eventually become central to hazard testing strategies, simultaneously reducing the costs and time associated with traditional approaches.
It will be necessary to combine the elements of high-information content methods (HIC), HtS, and ItS via PRA. the intent is to identify human hazards prospectively via efficient and effective analytical methods. the basic hypothesis of a PRA-HIC/ HtS framework is that the approach provides useful information for current knowledge gaps and also better informs hazard decisions. PRA approaches, historically, have been based on traditional toxicological data (Chen et al., 2007). Here, we suggest using the data coming from HtS and HIC approaches. It is essential to develop a conceptual framework for integration of such test data coming from different sources to allow for integrated and reliable endpoint assessment, which we generally refer to as ItS. Such a decision-analytic framework will yield a more comprehensive basis upon which to guide decisions. A natural outgrowth of this approach is an increased capability to combine and reuse existing data. the integration of such probabilistic hazard information with probabilistic exposure information (van der Voet and Slob, 2007) and probabilistic dose response assessments by PBPK (Kodell et al., 2006) represent logical extensions of this approach. As a result, the goal must be to adapt HtS, HIC, and PRA to better inform hazard decisions of manufacturers and regulators.
rather than with a dose given to the animal, making it difficult to extrapolate the findings to an intact organism. One of the most obvious differences between the situation in vitro and in vivo is the absence of processes of absorption, distribution, metabolism, and excretion (i.e., biokinetics) that govern the exposure of the target tissue in the intact organism. In addition, metabolic activation and/or saturation of specific metabolic pathways or absorption and elimination mechanisms may also become relevant for the toxicity of a compound in vivo. These differences may lead to misinterpretation of in vitro data if such information is not taken into account. Therefore, predictive studies on biological activity of a compound require the integration of data on the mode of action with data on biokinetic behavior.
QIVIVE is the process of estimating the environmental exposures to a chemical that could produce target tissue exposures in humans equivalent to those associated with effects in an in vitro toxicity test (e.g., an EC50, a benchmark concentration, or an interaction threshold identified by a biologically based dose-response model for the toxicity pathway of concern). Using a combination of quantitative structure-property relationship (QSPR) modeling, physiologically based biokinetic (PBBK) modeling, and collection of in vitro data on metabolism, transport, binding, etc., QIVIVE can provide an estimate of the likelihood of harmful effects from expected environmental exposures.
Biokinetic modeling describes the dose and time-dependent absorption, distribution, metabolism, and elimination of a chemical within an organism. Biokinetic models can be divided into two general groups: data-based (classical) models and physiologically-based models (Andersen, 1991;Filser et al., 1995). Physiologically-based biokinetic (PBBK) models are especially useful for in vitro-to-in vivo, route-to-route, and animal-to-human extrapolations because they incorporate relevant anatomical structures that can be parameterized using independently derived parameters. In contrast to data-based models, PBBK modeling allows the description of the time-course of a compound's amount/concentration at the site of its action. PBBK modeling can contribute to reduction and refinement of animal studies by optimization of study design through identification of critical parameters and timeframes in kinetic behavior (Bouvier d'Yvoire et al., 2007;Clewell, 1993). In addition, PBBK models incorporating QSAR-and in vitro-derived parameters, coupled with in vitro assays of tissue/organ toxicity, have the potential to replace in vivo animal studies for quantitative assessment of the biological activity of xenobiotics (Blaauboer, 2001(Blaauboer, , 2002(Blaauboer, , 2003. The overall goal of this paper is to identify the key research needs to support a viable QIVIVE capability. The research proposed in this paper is considered to be fundamental to the successful use of in vitro kinetic data and PBBK modeling for A recent expert panel review of the available science relevant to the 7 th Amendment of the EU Cosmetics Directive's 2013 marketing ban (Adler et al., 2011) analyzed toxicokinetics, among other issues, and concluded that it would take more than five years for the development of methods for estimating in vivo kinetics necessary to support risk assessments based on in vitro assays for systemic toxicity. The proposed roadmap identifies the key research needed to support quantitative in vitro-to-in vivo extrapolation (QIVIVE) for systemic toxicity for all chemicals. The common aim of this research is to foster the development of a methodology that incorporates state-of-the-art biokinetic modeling techniques to extrapolate critical concentrations at which in vitro toxicity is observed to be equivalent to in vivo doses based on the prediction of in vivo target tissue dosimetry. Kinetics should not be seen as a separate endpoint; rather, it is a tool to understand in vitro toxicity results and properly extrapolate them to human exposure. This methodology will provide a general framework for replacement of in vivo animal systemic toxicity assays with alternative in vitro toxicity testing.
The aim of classical toxicological risk assessment is to establish safety factors for human exposure based on the evaluation of the outcome of animal tests. The principal concern is finding the dose that causes no toxicologically relevant effect in the animal studies and extrapolating to the no-effect dose in the human under the application of appropriate safety factors. Most of the efforts to replace animal testing with alternative methods have focused on the use of in vitro tests for topical toxicity, such as skin and eye irritation (Hartung, 2010a). In contrast to their relatively straightforward application for topical toxicity, the use of in vitro toxicology methods as replacements for systemic toxicity testing faces significant challenges. In particular, these studies associate an effect with a concentration in medium 2 A Roadmap for the Development of Alternative (Non-Animal) Methods for Toxicokinetics Testing chemical may suffice for many chemicals. Even with such a simple model, it would be possible to estimate the systemic concentrations expected to result from an in vivo exposure to a given dose. Thus, the model could be used to relate the concentrations at which toxicity is observed in an in vitro toxicity assay to the equivalent dose expected to be associated with toxicity for in vivo exposure. Similarly, biokinetic modeling of the in vitro toxicity assay can provide important information on the temporal profile of cellular exposure to a free chemical, which can be used in the design of the most appropriate in vitro experimental protocol (Teeguarden and Barton, 2004).
The greatest challenge in parameterizing even the simplest biokinetic models is the estimation of metabolic clearance. QSAR algorithms for predicting metabolism parameters have only been developed for a limited number of chemicals, primarily volatile organic compounds that are substrates for CYP2E1 (Peyret and Krishnan, 2011). Thus, it would be necessary to perform in vitro assays of the dose-response (capacity and affinity) for metabolic clearance (Houston and Carlile, 1997;Kedderis, 1997;Kedderis et al., 1993;Kedderis and Held, 1996). Eventually, as data accumulates for a large number of chemicals, it may become possible to predict clearance using QSAR approaches. Qualitative prediction of whether a drug is likely to be cleared by metabolism (including the CYP isoenzyme involved) or by urinary excretion on the basis of its physicochemical properties, has recently been demonstrated (Kusama et al., 2010). Of course, there is much more extensive data on drugs than on environmental chemicals.
There are chemicals, of course, for which a one-compartment description would not be expected to be adequate: highly lipophilic compounds, for example, or compounds for which the extrapolation of in vitro toxicity data to in vivo. This research roadmap will specifically address uncertainties in the effect of biokinetics on the estimation of systemic toxicity (both acute and subchronic) of xenobiotics from in vitro assays. Figure 2.1 illustrates a conceptual structure for the use of biokinetic information in the estimation of in vivo toxicity from in vitro assays. In this scheme, available in vitro data on the absorption, tissue distribution, metabolism, and excretion of a chemical are used to parameterize a chemical-specific biokinetic model. In many cases, current quantitative structure-property relationship (QSPR) techniques can be used to estimate chemical properties and kinetics when the specific data for that chemical is lacking. For example, simple empirical correlations have been developed for estimating the tissue partitioning of a chemical from its water solubility, vapor pressure, and octanol/water partitioning (DeJongh et al., 1997;Paterson and Mackay, 1989;Poulin and Krishnan, 1995). In addition, emerging quantitative structure-activity relationship (QSAR) techniques (e.g., knowledge-based systems) and other in silico models will become increasingly useful for identifying likely metabolites and predicting potential target tissues for toxicity (Barratt, 2000), so that the appropriate assays of in vitro effects can be selected. These target tissue assays then can provide information on the nature and concentration-response of the toxic effects of the chemical.

Overview of QIVIVE
The complexity of the biokinetic model would depend on the physicochemical and biochemical characteristics of the chemical. A simple one-compartment description of the administered  Blaauboer et al., 2001) quantitative assessment of the biological activity of xenobiotics (Blaauboer, 2001(Blaauboer, , 2002(Blaauboer, , 2003. Target tissues evaluated by in vitro assays can be included explicitly in the physiological structure of these models. The models can provide a mechanistic description of barrier functions (gut, bile, kidney, blood-brain barrier, skin, and placenta (if reproductive or developmental toxicity are under investigation)), so that the data obtained from transporter assays could be readily incorporated. Important research areas for in vitro methods include the development of validated, stable human hepatocyte systems, as well as in vitro systems for key transporters (renal, biliary, etc.). At the same time, QSAR applications need to be developed specifically to provide the kind of information needed by the PBBK models (metabolism constants, binding, etc.). Unfortunately, except in the case of drug-like compounds, the principal limitation in the development of useful QSAR applications appears to be the dearth of suitable data available for training knowledge-based systems. Nevertheless, in silico methods are of great interest, and some of them are under development or in the testing phase. They will gain more importance depending on the data, which will be fed in and, therefore, reliable and relevant in silico methods are to be expected.
The utility of an approach that integrates cell-based assays with QIVIVE has been demonstrated in the case of acute neurotoxicity for eight chemicals: benzene, toluene, lindane, acry-toxicity results from a metabolite. The physiological mammalian structure (tissue volumes, blood flows, ventilation rate, glomerular filtration rate, etc.), however, is well characterized (EPA, 1988;Brown et al., 1997), and there is no difficulty describing tissues separately when necessary. As mentioned above, techniques exist for estimating tissue-specific partitioning for many types of compounds. Other data required would also depend on the class of chemical. For volatile chemicals, ventilatory clearance can be estimated from the blood-air partition. For watersoluble chemicals, urinary clearance can be estimated from the glomerular filtration rate or the renal blood flow (for secreted compounds). For some classes of chemicals, it would also be necessary to determine the fractional binding of the chemical to plasma proteins or the partitioning of the chemical into red blood cells.
An important underpinning of this process is that the kind of information necessary for a chemical depends on its structure and physicochemical properties. It seems reasonable to expect that chemicals could be categorized into classes based on their properties, and that this categorization would simplify the process of determining the data needed for a particular compound. This concept is illustrated in Figure 2.2.
PBBK models incorporating QSAR-and in vitro-derived parameters, coupled with in vitro assays of tissue/organ toxicity, have the potential to replace in vivo animal studies for  Blaauboer et al., 2001) In this figure, the key physicochemical properties of a compound include its volatility, water solubility, and lipophilicity. These properties can be thought of as dimensions in which compounds can be categorized. In this way, compounds with similar properties can be grouped, and data for similar compounds can be used to fill gaps in the knowledge of a particular compound. For example, a recent study evaluated the possibility of predicting the in vivo kinetics of volatile organic compounds (VOCs) using PBBK models derived solely on the basis of physiological data and QSPR modeling (Liao et al., 2007). The authors concluded that acceptable predictions could be made for inhalation of lipophilic VOCs, such as trichloroethylene, but that the necessary QSPR algorithms were not available for water-soluble VOCs such as acetone.
of potential in vivo exposures without consideration of bioavailability and clearance of the chemicals (Blaauboer, 2010). Two recent studies evaluated the possibility of applying a simple QIVIVE approach to interpret the results of high-throughput assays conducted under the EPA ToxCast program (Rotroff et al., 2010;Wetmore et al., 2011). In these studies, hepatic metabolic clearance and plasma protein binding were experimentally measured for ToxCast Phase I chemicals. The experimental data were used to parameterize a simple in vitro-to-in vivo extrapolation model to estimate the human oral equivalent doses necessary to produce steady-state in vivo blood concentrations equivalent to in vitro AC50 (concentration at 50% of maximum activity) or LEC (lowest effective concentration) values in the in vitro ToxCast assays.
A simple clearance description (Wilkinson and Schenker, 1975) was used to estimate expected steady-state blood concentrations. The equation assumes zero-order uptake of a daily dose from the gut (assuming 100% oral bioavailability) with both renal and hepatic clearance. The steady-state concentration in the blood is then (see discussion in next section): In this equation, the term GFR x Fub represents the renal excretion of unbound parent compound in blood by glomerular filtration, where GFR is the glomerular filtration rate, which is about 6.7 l/h in human adults (Rule et al., 2004), Fub is the fraction of the drug in the blood that is unbound (free), and ko is the input rate in mg/kg/h. The second term in the denominator is hepatic clearance, where Ql is liver blood flow (typi-lamide, parathion/oxon, diazepam, caffeine, and phenytoin (Blaauboer, 2001). The aim of the study was the prediction of acute and subchronic neurotoxicity by integrating PBBK modeling with quantitative toxicity data obtained from non-animal studies. Specifically, the study evaluated the ability of in vitro neurotoxicity tests to predict the in vivo toxicity of the above chemicals, using PBBK models describing their biokinetic behavior to conduct QIVIVE. Model simulation of the target tissue dosimetry (i.e., the parent brain concentration) formed the basis for the prediction of the compound's systemic toxicity (Cronin et al., 2011) for different exposure scenarios (acute and subchronic). Subsequently, the neurotoxic concentrations estimated in in vitro tests (Kuegler et al., 2010;Crofton et al., 2011) could be compared with the brain concentrations simulated by the model. This approach allowed the authors a comparison of the toxic in vivo dose known from the literature with the modelpredicted dose suspected to cause neurotoxicity. Overall, the results of this study showed that a reasonable prediction of the systemic toxicity could be made for six out of the eight investigated compounds. The discrepancy between the observed and estimated LOELs ranged from a factor of less than two for compounds with low toxicity, to a factor of ten for chemicals of high toxicity (Forsby and Blaauboer, 2007).

Example of a simple QIVIVE approach for parent chemical toxicity
High-throughput in vitro toxicity screening can provide efficient identification of the potential biological activity of chemicals. However, the concentrations at which effects are observed in the in vitro assays cannot be used to directly evaluate the safety sumptions (Wilkinson and Schenker, 1975) were employed: (1) restrictive hepatic clearance (assuming only unbound chemical is available for clearance), using Fub determined experimentally; and (2) non-restrictive hepatic clearance (assuming all of the chemical is available for clearance), where the Fub was set to one. Riclosan is an example of a chemical that appears to have restrictive clearance, while picloram appears to have non-restrictive clearance, and the behavior of lindane appears to be intermediate between the two extremes. In general, the assumption of restrictive clearance produces a more conservative (higher) estimate of Css. In fact, these two clearance assumptions represent extremes bracketing the possible relationship between chemical disposition/transport and hepatocellular metabolism that can result in Css estimates that differ by several orders of magnitude. Hepatic clearance is complexly determined by a number of factors, including liver blood flow, the association and dissociation rates for binding of the chemical to plasma proteins such as albumin, the kinetics of hepatocellular uptake of the chemical, and the kinetics of hepatocellular metabolism. Indeed, no approaches have yet been demonstrated to predict the fraction of compound available for metabolism, even in the case of drugs.
The assumption of 100% oral bioavailability is conservative from a human health standpoint because lower absorption results in a higher oral dose required for achieving a specific Css; however, incorporation of Caco-2 assay data on bioavailability into the QIVIVE model can increase the predictivity of the Css cally on the order of 90 l/h in adults) and Clint is the intrinsic metabolic clearance for first-order conditions of metabolism in the liver at low concentrations. Hepatocellular clearance in this study was experimentally determined at 1 μM and the slope of the disappearance of the chemical over time was determined. Clearance was normalized to cell number, with the units μl/min/10 6 cells. In vivo intrinsic clearance was estimated by simply multiplying the in vitro clearance by the number of cells per gram of liver (roughly 137 x 10 6 ) and the weight of the liver (about 1820 g in an adult). Css calculations were performed using an arbitrary dose of 1 mg/kg/day. The Simcyp simulation platform (Rostami-Hodjegan and Tucker, 2007) was used to perform Monte Carlo analysis to simulate variability across a population of 100 healthy individuals of both sexes from 20-50 years of age. A coefficient of variation of 30% was used for intrinsic and renal clearance. Reverse dosimetry was then used to generate oral equivalent doses according to the following formula: Oral Equivalent Dose (mg/kg/day) = AC50 or LEC/Css For a small number of these chemicals, it was possible to find in vivo biokinetic data to estimate a steady state concentration at an exposure of 1 mg/kg/day for comparison with the in vitro predictions. The results of the comparison are shown in Table 2.1.
For comparison purposes, two alternative hepatic clearance as- one of the advantages of using in vitro metabolism data over in vivo experiments.

Research gaps
The subsequent sections of this paper will attempt to elucidate the key research areas needed to support QIVIVE for assessing risks on the basis of in vitro toxicity data, including: -Improving the accuracy of in vitro toxicity assays by determining the free concentration of chemical instead of simply using the nominal concentration -Extrapolating in vitro kinetic results to estimate in vivo clearance -Obtaining parameters for PBBK models to perform QIVIVE The proposed key research areas are summarized in Figure 2.3 and Table 2.2.

Characterization of free concentration
The free concentration of a chemical drives both its kinetics and dynamics (Mendel, 1992). The concentration of free chemical in an in vitro assay that elicits a certain response may differ determination, as in the case of oxytetracycline dihydrate and picloram in Table 2.1. On the other hand, the assumption that renal clearance is solely a function of Fub and the GFR is not necessarily conservative, since active renal resorption would result in a higher Css at a given dose, as in the case of the two perfluorinated chemicals in Table 2.1.
Other limitations of this simple approach include: -The analysis is predicated on the assumption that blood concentrations equivalent to the nominal in vitro AC50 or LEC values would produce equivalent responses in vivo. However, the concentration of free chemical in an in vitro assay that elicits a certain response may differ from the nominal AC50 value due to factors such as protein-lipid composition of the media and binding of the chemical to surfaces (Blaauboer, 2010). -The biokinetics and bioactivity were only evaluated for the parent compound. No attempt was made to evaluate biological activities and dosimetry of metabolites.

Example of a QIVIVE approach for toxicity of a metabolite
A couple of publications by Punt and colleagues (Punt et al., 2008(Punt et al., , 2009) present an example of a more sophisticated QI-VIVE approach using metabolism data collected in a number of subcellular fractions. Although the intent of the study was to evaluate the relevance of carcinogenicity of estragole reported in high-dose animal studies to human exposure situations, a similar QIVIVE approach also could be applied for the interpretation of in vitro toxicity assays. The key metabolism parameters to be estimated were rates of multiple biotransformation reactions that determine the level of carcinogenic species (1-sulfooxyestragole) in the liver. Due to the complexity of metabolism steps involved in formation of the ultimate carcinogenic metabolite of estragole as well as detoxication of parent compound and other intermediate metabolites, the approach using a combination of subcellular fractions along with different cofactors was more valuable for the purpose of their modeling than using a more integrated system such as hepatocytes (Punt et al., 2008(Punt et al., , 2009. By manipulating cofactors such as NADPH, UDPGA, NAD + , and PAPS in the selected in vitro system of microsomes or S9, multiple steps of estragole metabolism mediated by CYPs, UGTs, dehydrogenases, and SULTs, respectively, could be characterized. The rates of those reactions were used to describe the critical metabolism pathways in estragole bioactivation and detoxication. Those reactions were described well with Michaelis-Menten kinetics and the resulting Vmax and Km parameters were scaled to in vivo based on the microsomal or S9 protein content. The interplay of these multiple reactions was integrated in the PBBK model and the simulated concentrations of two estragole metabolites in the rat and human urine were reasonably consistent with the observed in vivo data considering the purpose of the modeling was to evaluate the dose-dependent changes in bioactivation, not to predict the absolute dose metrics (Anthony et al., 1987;Punt et al., 2008Punt et al., , 2009

In vitro estimation of intestinal absorption and metabolism
To accurately predict the systemically available dose of the chemical, it is important to consider potential metabolism at the portals of entry in addition to the hepatic metabolism. Despite its importance as a modifier of oral bioavailability, intestinal metabolism has received less attention than other extrahepatic metabolism. The mucosal epithelium of the gastrointestinal (GI) tract contains substantial amounts and types of xenobiotic-metabolizing enzymes, among which CYP3A enzymes have been the focus of a great deal of research in pharmaceuticals due to their role in causing reduced oral bioavailability and as a major source of inter-individual variability resulting from variable constitutive expression of gut CYPs and potential drug-drug interactions (Paine et al., 1997). From the risk assessment point of view, other phase I and II enzymes in the GI tract should also be carefully considered in IVIVE. In addition to the liver, the GI tract is a key site for hydrolysis of a number of ester compounds of environmental concern that are used in pesticides and consumer products, including pyrethroids, phthalates, and parabens (Kluwe, 1982;Crow et al., 2007;Imai, 2006). The significance of intestinal phase II metabolism to total chemical clearance is another factor to be considered in IVIVE of metabolism. Intestinal glucuronidation of BPA demonstrates the importance of consideration of intestinal metabolism to provide key information for describing BPA biokinetics for human health risk assessment based on in vitro metabolism information (Mazur et al., 2010). Compared to IVIVE of hepatic metabolism data, there are several challenges in extrapolating in vitro intestinal metabolism parameters to in vivo. First, the intestine is not a homogenous organ and therefore spatial differences are evident in distribution of metabolizing enzymes within the mucosa as well as along the length of the intestine (van de Kerkhof et al., 2007). This factor makes it difficult to interpret and extrapolate in vitro metabolism parameters obtained from intestinal tissue-driven in vitro systems such as microsomes and S9 fractions (van de Kerkhof et al., 2007). Intestinal cell lines such as Caco-2 (Karleta et al., 2010) have been used to determine absorption parameters in vitro (Sambuy et al., 2005), but use of these cell lines as a surrogate for metabolism in the GI tract is problematic due to differences in enzyme expression compared to human intestinal tissue (Imai et al., 2005;van de Kerkhof et al., 2007). Another complication comes from the fact that intestinal metabolism often is greatly influenced by chemical flux into the enterocytes, i.e., intestinal metabolism is closely related to the uptake/absorption process, making it difficult in terms of both measurement and interpretation of the results (Paine et al., 1997;Yang et al., 2007). More studies are warranted to develop better in vitro tools to predict intestinal metabolism, and then better extrapolation strategies can be developed based upon the relevant in vitro metabolism data for coherent extrapolation considering the interplay with chemical absorption processes in the intestine.

In vitro determination of dermal exposure
For environmental and cosmetic chemicals, the dermal route of exposure is highly likely. Therefore, in vitro assays should be fur-from the nominal concentration (added amount of chemical divided by volume of the medium) due to factors such as protein/ lipid binding in the medium Seibert et al., 2002), evaporation, precipitation, and adherence of the chemical to surfaces (Blaauboer, 2010). To determine the in vivo plasma concentration expected to elicit a target-tissue response similar to the cellular response in the in vitro assay, the free fraction must be determined in both the in vitro and in vivo exposures (Gulden et al., 2006;Gulden and Seibert, 2003;Teeguarden and Barton, 2004). To the extent that the cells in the in vitro assay are representative of the cells in the in vivo target tissue, equal free concentration in the medium and plasma will be associated with the same intracellular exposures (Gulden et al., 2001).
Protein binding can be a key determinant of disposition (Gulden and Seibert, 1997), affecting compound availability for uptake into cells in vitro as well as target tissues in vivo. For example, the use of whole serum or serum albumin in cell-based assays can greatly alter the apparent dose-response for cellular toxicity compared to serum-free media (Hestermann et al., 2000;Brunner et al., 2010). A high fraction bound also gives rise to concerns regarding potential competitive binding by other compounds that could modulate the free concentration (Teeguarden and Barton, 2004). Methodologies to estimate protein binding and approaches for the description of the kinetics of binding in biokinetic models have been areas of intense interest over the past four to five decades. Consideration of protein binding faces two parallel challenges: first, when compounds are bound in media or capillary blood, what fraction should be regarded as available for transport into cells or tissue, and, second, how does the binding influence medium/cell or blood/tissue partitioning.
In general, medium, cells, blood, and tissues all will contain free and bound forms of the compound. For equilibration, only the free compound diffuses across the medium/cell or plasma/ tissue interface, and at equilibrium the free concentration on both sides of the interface is expected to be equal (except in the case of active transport). However, the equilibrium relationship of the concentration in cells or tissues compared to the medium or plasma is typically described with empirical partition coefficients based on measurements of total concentrations of the compound. Differential binding, therefore, will influence apparent partitioning. However, there are quite a number of different determinants of apparent partitioning, complicating the interpretation of such data: -Partitioning due to lipophilicity -Plasma binding -Tissue binding -Active transport -Clearance processes -Blood:plasma ratio The blood:plasma ratio is needed for converting tissue:plasma partitions to tissue:blood, or fraction unbound in the plasma to fraction unbound in the blood (Yang et al., 2010).
Furthermore, the application of analytical techniques is considered a prerequisite for proper QIVIVE. The workshop participants agreed that their use (as opposed to nominal concentration) is critical and their importance not enough appreciated.
ing pesticides and endocrine active compounds, they have not been well studied compared to CYPs and other phase I and II enzymes. To describe the role of esterases on detoxication of the chemical, it is necessary to include extrahepatic metabolism, most representatively the metabolism in blood due to the presence of carboxylesterases and other types of esterases. Metabolism in the GI tract and skin should also be characterized to estimate esterase-mediated detoxication capacity in the body (Prusakiewicz et al., 2006).

Identification of the key metabolism pathways and toxic moieties
To be performed correctly, QIVIVE requires information on what the active entity would be in the target tissue based on the potential mechanisms of toxicity. Predicting primary metabolic pathways, along with the potential for producing active metabolites, could be supported by in silico approaches such as QSAR (Kusama et al., 2010). Knowledge built on drug data showing the role of chemical properties in metabolism, binding, and partition would help this categorization. To determine the extent and design of in vitro metabolism assays aided by such tools, the criteria for this classification should be based on major pathways of metabolism, since that is the key information needed in designing in vitro metabolism studies for IVIVE.

Organotypic models of in vivo hepatic function
Ensuring realistic metabolism, both qualitatively (the types of metabolites formed) and quantitatively (the relative amounts of these various metabolites) is one of the most difficult challenges in QIVIVE. One possible direction for meeting this challenge is the development of organotypic hepatic systems (bioreactors) that appropriately reflect the complexity of in vivo hepatic function. In principle, these systems could be used to provide data for in silico modeling of both kinetics and dynamics (Sung et al., 2010). However, a number of difficulties will need to be overcome: (1) developing screening analytical chemistry methods that would allow rapid evaluation of metabolites produced and excreted from the cells or cell aggregates in culture, (2) development of stable organotypic liver cultures that recapitulate in vivo metabolism for sequential or parallel metabolic networks, and (3) ensuring metabolic competencies, both metabolite production and parent and metabolite loss, from the tissue culture system by metabolism or routes of non-specific loss, such as renal excretion.
These organotypic hepatic cell cultures could be used to rapidly assess metabolism and confirm QSAR predictions of likely metabolites. Metabolites that were identified in some significant yield might themselves be studied in the in vitro test systems. In the past, analytical methods development was tedious and time-consuming. It is possible, however, that this process could be accelerated with modern methods of higher throughput analytical chemistry. The ultimate goal would be the development of tissue cultures or hepatic bioreactors (Seagle et al., 2008) that include recirculation and medium replenishment over time to mimic an in vivo situation.
The ability to assess metabolism by examining effluent compounds from the culture systems could be coupled with other ther developed to predict the rate of dermal penetration and metabolism in the skin. The challenge in predicting accurate dermal uptake and metabolism is similar to that for intestinal absorption, in that absorption and metabolism are competing processes. Human skin contains both CYP enzymes (Storm et al., 1990) and esterases (Prusakiewicz et al., 2006), which can be of importance for presystemic clearance of a compound as well as for generation of toxic metabolites if the skin is a target tissue.

In vitro estimation of metabolism
The success of IVIVE is largely dependent on the quality and relevance of in vitro metabolism data (Coecke et al., 2006). There have been significant improvements in the quality of human tissue preparation in recent years, as well as parallel advances in application strategies of those in vitro data to predict in vivo kinetics (Chiba et al., 2009;Gomez-Lechon et al., 2007;Houston and Galetin, 2008). These advances have made it possible to implement QIVIVE for PBBK models during drug development (De Buck and Mackie, 2007;Pelkonen and Turpeinen, 2007;Rostami-Hodjegan and Tucker, 2007). For pharmaceutical compounds, however, the screening of new chemical entities involves evaluation of whether the candidate possesses drug-like properties, including relatively moderate metabolism and inactive metabolites. Thus, IVIVE for drug metabolism has focused largely on metabolic stability screening to inform the drug's half-life and oral bioavailability using the clearance model (Pelkonen and Turpeinen, 2007;Houston and Galetin, 2008). For this type of IVIVE, linking the total intrinsic clearance in vitro in conjunction with the unbound fraction in blood and the liver blood flow to predict in vivo clearance has been the most common practice (Fagerholm, 2007;Houston and Galetin, 2008) Although the experience built upon drug data can be applied to the IVIVE approach for chemicals, the challenges in QIVIVE for chemicals are different from those for pharmaceuticals, primarily due to the wider range of chemical properties compared to drugs. There is also a greater need to consider the role of metabolism in determining chemical toxicity. For chemicals, IVIVE should preferably be conducted at the level of an individual enzyme/metabolic pathway primarily responsible for formation of the active species or depletion of the active parent compound instead of measuring total intrinsic clearance of the parent chemical. The apparent limitation of applying total clearance-based IVIVE to chemicals has its difficulty in describing the formation and clearance of toxic metabolite(s). Another limitation arises from dealing with the broader range of exposure concentrations and routes for chemicals compared to a narrower/targeted concentration range and oral route for drug candidates. IVIVE issues will also vary depending on the kinds of enzymes involved in chemical metabolism. Both chemical properties and knowledge of mechanism of action inform which metabolic pathways would be primarily responsible for chemical metabolism. This information can serve as criteria for categorizing chemicals into subgroups for different strategies based on primary metabolic enzymes.
Despite the fact that esterases are known to play an important role in metabolizing many environmental chemicals, includ-

Possible strategy to determine metabolites
Techniques to estimate the concentration of a substance at the site of action include both direct and indirect ones: biomarkers, microdialysis, imaging, mass spectrometry, and simulations by modeling (Pelkonen et al., 2008). Special emphasis should be placed on the use of advanced bioreactors (Darnell et al., 2011), including relevant cell systems, e.g., HepaRG cells, to mimic the appropriate metabolism combined with analytical methods. Mass spectrometry, in particular, has proven to be an optimal tool to determine metabolites. As Pelkonen and co-workers state (Pelkonen et al., 2009) "…in silico or in vitro, in conjunction with animal data, provide useful and necessary information, on which to base the first PK studies in humans. The prerequisite is to use appropriate and up-to-date techniques and biological preparations." The final goal would be to build a virtual human to model the whole process a compound undergoes in the human body to enhance drug development and improve risk assessment. A starting concept, taking into account available techniques, can be found in Figure 2.4.

In vitro estimation of renal clearance
The state of the art for in vitro models of renal clearance is not as advanced as in the case of liver clearance, although some progress has been made in the case of drugs (Kusuhara and Sugiyama, 2009). The relative spatial complexity of renal tubular transport systems compared to the more homogenous hepatic metabolic analyses to evaluate fidelity between the in vivo and in vitro pathways. A well-designed liver bioreactor could function in a fashion similar to isolated-perfused liver preparations (Bessems et al., 2006). Analysis of metabolites produced in a bioreactor might also serve to benchmark expected metabolic pathways. Evaluation of the fidelity of the bioreactor and new organotypic systems could be verified by assessing metabolite profiles with specific test compounds, i.e., using compounds whose metabolism has already been well-studied in vivo.
It may be necessary to develop co-culture systems or microfluidic systems that maintain metabolism, recirculation, continuous addition of test compound, and ongoing loss from the culture system. The microfluidic, body-on-a-chip design (Maguire et al., 2009) has potential for creating custom in vitro toxicity evaluations for multiple cells plated onto different parts of the microfluidic plate. This system requires more development, especially to move from a laboratory research device to low to medium throughput. The system was designed based on PBPK model structures developed by Shuler and colleagues (Esch et al., 2011). Another possibility might be to have a relatively large hepatic bioreactor and to divert flow to multiple chambers with various cell types for in vitro testing. The cells would have continuous flow of the bioreactor fluid, and the effluent from the culture plates could be collected and re-circulated to the bioreactor. While these designs are not yet available, they are technically within reach.

Fig. 2.4: Proposed strategy to assess metabolite effects in in vitro studies
(adapted from Pelkonen et al., 2009) the researcher and analysts, so they will be able to augment each other. As recently explained by Jaworska and Hoffmann (2010) via the concept of Bayesian networks, the structure of the testing strategy matters and will influence the risk assessment process. Complex networks will not lose rare but important events or small but multiple perturbations in key nodes (Jaworska and Hoffmann, 2010).

Conclusions and recommendations: toxicokinetics
The main objective of this chapter was to determine the research needs for developing a methodology to incorporate in vitro kinetic data into in vivo biokinetic models to support risk assessments based on cellular toxicity assays. The proposed methodology starts with the identification of the critical aspects of the metabolism of a compound for the intended purpose of the risk assessment. This preliminary information includes a combination of qualitative metabolism studies and selected in vitro toxicity assays to identify the active species and primary metabolic pathways responsible for producing and detoxifying the toxic entity. Current examples of IVIVE often rely, in part, on existing in vivo data. As experience with IVIVE accumulates, however, it will become increasingly possible for such information to be gained from in silico-based prediction tools and targeted in vitro kinetic studies, particularly using organotypic in vitro systems that better mimic in vivo conditions.
The current state of the art presents an excellent opportunity for development of improved in vitro ADME methodologies. The technologies necessary to support these initiatives are now coming to maturity, and the need for rapid toxicity testing of both drugs and commercial chemicals is becoming more acute. Recent advances in stem cell biology may allow the development of custom bioreactors with more relevant cellular components and allow the bioreactor to serve as both a metabolite generator and a test system for the toxicity and biological responses of molecules.

Recommendations: toxicokinetics
General but indispensable: 1. For the extrapolation of an in vitro assay to in vivo the measurement of the free chemical concentration is absolutely necessary.

In vitro biokinetics should be taken into consideration
to improve the quality of in vitro toxicity data. 3. The use of kinetic parameters to correlate in vitro effective concentrations to a dose is absolutely essential. 4. Quality training data for the wide range of chemical property classes should be made available, particularly for "non-druglike" compounds. 5. Analytical methods and computational modeling should be taken into account and employed wherever possible.
architecture greatly increases the difficulty of developing representative in vitro model systems. However, it should at least be possible to develop assays to identify whether a compound is a substrate for a particular transporter (Yang et al., 2009). This would provide an indication of the likelihood that a compound's renal clearance might deviate from expectations based on glomerular filtration. A similar capability could be developed for assessing biliary clearance.

PBBK model development
The parameters in a PBPK model can be categorized into four types: exposure, physiological, partitioning, and metabolism. The exposure parameters are determined solely by the characteristics of the exposures and the physiological parameters are available from the literature (Brown et al., 1997). These types of parameters are not chemical-specific, and the values used in the evaluation of an untested compound would be the same as those used for well-characterized compounds.
Partitioning and kinetic parameters, however, are chemicalspecific and need to be estimated for untested compounds. A number of software platforms are available to support generic PBPK modeling for pharmaceuticals using in vitro metabolism data, as exemplified by the Simcyp platform (Jamei et al., 2009;Rostami-Hodjegan and Tucker, 2007). Because these generic platforms are designed to support modeling of drug compounds, their focus is on oral and intravenous exposures, and on metabolism by oxidative (CYP) and conjugative (UGT) enzymes. Effective use of these software platforms for PBPK modeling of environmental and personal care compounds would require enhancements in two areas: (1) addition of descriptions of dermal and inhalation exposure, and (2) addition of data on esterase metabolism enzymes. In the field of environmental risk assessment, PBBK models typically have been developed for individual chemicals. Although generic modeling platforms are available for some classes of compounds, e.g., MEGen (Loizou and Hogg, 2011), the development of generic models has not been as extensive as in the pharmaceutical area. A useful generic modeling platform would include the following features: -user-friendly, open access -database for physiological parameters -inhalation, dermal, and oral exposure routes -capability for multiple parallel metabolic pathways

Integrated testing strategies (ITS)
Toxicokinetics and the methods mentioned already should not be understood as stand-alone methods or endpoints. Kinetics is a tool to understand and modify any in vitro result and should be incorporated into testing strategies as a requirement for any extrapolation to in vivo. In general, an integrated testing strategy should consist of information about the physicochemical properties of a substance, the structure activity relationships (QSARs), in vitro data, and kinetic and dynamic modeling. All these factors combined should then lead to an evaluation against in vivo data (Dejongh et al., 1999). Experimental research, computational methods, and integrated testing strategies should be developed in an interactive way between stricted versus unrestricted hepatic clearance should be investigated. Data should be collected for hepatic and renal clearance, metabolism rates, gut absorption, and metabolism, especially for non-drug-like compounds. Also, identification of transporter substrates is a potential area of in silico modeling. 10. QIVIVE case studies, with special emphasis on those that did not work, should be performed for different classes of physicochemical properties, different metabolism pathways, toxicity from parent versus stable metabolite versus reactive metabolite, and portal of entry versus liver versus remote toxicity.

Special research areas:
11. The very promising areas of in vitro bioreactors and the microfluidic human-on-a-chip should be further developed and standardized. 12. High-Throughput Toxicity Screens combined with kinetics data should be further investigated. 13. An equivalent to the Lipinsky rules for drugs should be developed for chemicals.
Further research areas: 6. Improved in vitro models are needed, particularly in the areas of intestinal and dermal absorption and the associated presystemic metabolism. Hepatic, renal, and respiratory clearance also are of special interest. Organotypic culture kinetics and metabolite identification should be investigated. Barriers should be taken into account by appropriate in vitro assays. 7. The development of generic PBPK modeling platforms should be furthered. They should be user-friendly and have open access, with a database for physiological parameters. They should be able to simulate inhalation, dermal, and oral exposure, allowing description of multiple parallel metabolic pathways. 8. Standard methods for the characterization of the free concentration in cell-based assays should be developed, including the features of binding, metabolism, and active transport into the cell. 9. In the area of data collection and in silico approaches, metabolite identification and protein binding in cellbased assays should be addressed. Furthermore, re-record of past and present activity in this area (Adler et al., 2011;Aeby et al., 2010;Basketter and Maxwell, 2007;Basketter, 2008;Basketter and Kimber, 2011;Bauch et al., 2011;De Silva et al., 1995;dos Santos et al., 2009;Hartung et al., 2011;Kimber et al., 1999bKimber et al., , 2001Kimber et al., , 2010Kimber et al., , 2011Martin et al., 2010;Maxwell and Mackay, 2008;Patlewicz et al., 2007;Reuter et al., 2011;Ryan et al., 2005;Vandebriel and van Loveren, 2010). However, our intention here is not to provide yet another scholarly review article. It is rather to provide a personal perspective with regard to the development of alternative methods for skin sensitization testing, how the current landscape appears to us, and what we believe are the requirements for (and distractions from) achieving real success in addressing this challenge. This is not, therefore, a consensus document. One important aim is to be provocative, to excite argument and discussion. If what follows is at odds with the views of others, we apologize in advance for any perceived criticism and ask only that the article be viewed as a stimulus for informed debate.
We have chosen to address this issue by tackling a number of relevant questions for an assessment of the current state of the art of alternative methods for skin sensitization testing and what future needs might be. These are as follows: 1. What is it that we are really trying to achieve -what will success look like? 2. Is the international scientific community marshaled in the right way to make real progress in this area? 3. What should be the future research imperatives? 4. Skin sensitization testing in vitro -can we do it already? 5. Is hazard identification alone good enough? 6. What needs to change?

What is it that we are really trying to achieve -what will success look like?
There is a genuine need to identify chemicals that have the potential to cause skin sensitization and allergic contact dermatitis (ACD) and to accurately assess the likely risks to human health. Many hundreds of chemicals cause skin sensitization, and ACD is a common occupational and environmental disease (Febriana et al., 2011). Historically, the identification of contact allergens relied on the use of guinea pig tests (Buehler, 1965;Magnusson and Kligman, 1969). More recently, the murine local lymph node assay (LLNA) has found favor as a method that, compared with guinea pig tests, offers important animal welfare benefits (Kimber et al., 2002. The LLNA provides a generally robust and reliable means of identifying chemicals that have the potential to cause skin sensitization but also permits a characterization of relative sensitizing potency, which is required for the development

Introduction: skin sensitization
Allergic contact dermatitis resulting from skin sensitization is an important occupational and environmental health problem. Many hundreds of chemicals are known to cause skin sensitization, and allergic contact dermatitis is the most common manifestation of immunotoxicity in humans. It is important, therefore, that skin sensitization hazards/risks of new chemicals and products be evaluated accurately. In fact, toxicologists have methods and models that provide a reliable basis for the identification of skin sensitizing chemicals, assessment of relative sensitizing potency, and the development of effective risk assessments. However, current practices rely heavily on animal models of skin sensitization, particularly the local lymph node assay (LLNA), the preferred method for safety assessment.
There are now compelling reasons to develop novel approaches to skin sensitization testing that do not require the use of experimental animals. There has been -and continues to be -a very substantial investment in achieving this goal. Several promising methods are now undergoing validation. However, no non-animal test has yet been formally validated. Against that background, the purpose of this chapter is to provide a partisan (some might say biased!) view of the current development of alternative methods for assessment of skin sensitizing activity, to reflect on some of the challenges that we face in delivering novel testing strategies, and to provide a view of what is needed to improve and accelerate progress towards this objective.
A quick glance at the title of this chapter will trigger a feeling of déjà vu in many readers. Indeed, our own reaction would have been "Does the scientific literature really need another review of skin sensitization and the development of alternative methods for hazard and risk assessment?" The answer must be no -surely there is nothing new to say. There are already sufficient -or more than sufficient -reviews and overviews available. Some of those are cited here, and collectively they provide a comprehensive

A Roadmap for the Development of Alternative (Non-Animal) Methods for Skin Sensitization Testing
course, failure to negotiate any one of the hurdles will lead to the absence of sensitizing activity.
It will be clear that the move from an animal model to a non-animal method is fraught with difficulties and challenges. Put simply, mice used within the LLNA represent integrated biological models that incorporate, in a fully coordinated and physiologically relevant way, all the events and processes that are needed for the acquisition of sensitization. So, if a chemical is positive in the assay, then the assumption is that it has successfully achieved all that a chemical must accomplish to drive sensitization.
In a similar vein, it is estimated currently that approximately 25% of contact allergens must be activated by air oxidation, or within the skin, to acquire the chemical reactivity necessary to associate with proteins. It has proven difficult to achieve effective incorporation within in vitro models of appropriate and adequate oxidative/metabolic capacity.
The move from a fully integrated biological (animal) model is, therefore, necessarily going to be challenging. Nevertheless, that is the aim: to develop a method that will allow characterization of skin sensitizing activity without the use of animals and possibly with a higher level of confidence.
It is worth reflecting briefly on what is meant by the above phrase: "characterization of skin sensitizing activity." The first step in any toxicological investigation is the identification of hazard, and in the context of this article, to discriminate between chemicals that do, and chemicals that do not, have the potential to cause skin sensitization. Although that is an important step (and the aspect for which formal validation is required), it is not sufficient for addressing the likelihood that exposure to the chemical under any given set of circumstances will result in an adverse effect. For an effective risk assessment, an appreciation of relative potency is necessary. In the context of skin sensitization, this equates to the amount of chemical encountered on a skin surface that is required for the induction of sensitization. This, of course, is true for all forms of toxicological evaluation, but it is particularly important with regard to skin, because it is known that contact allergens vary by up to five orders of magnitude in relative skin sensitizing potency.
A measure of relative potency is provided by the LLNA because it is known that the readout used (the proliferation of draining lymph node cells) not only has a causal relationship with the acquisition of sensitization but also correlates quantitatively with sensitizing activity (Kimber and Basketter, 1997). Achieving an understanding of relative potency with a full in vitro approach is going to be very challenging. This is a theme that we will return to in Chapter 2.6. Drawing together the elements discussed above, the answer to the first general question we posed is as follows: The aim is to develop a non-animal method(s) that will provide a means of identifying contact allergens with an accuracy at least equivalent to, or approaching that of, the preferred animal models. While that is the minimum requirement, ideally, any novel method should improve upon the performance and reliability of the LLNA (considered the best animal model) and simultaneously provide an accurate assessment of relative potency.
Success, therefore, will be the development of a non-animal approach that provides a basis for reliable hazard and risk assessments of at least comparable accuracy to those afforded by the of accurate risk assessments (Api et al., 2008;Rovida, 2011). However, it has to be acknowledged that, as with all test methods, the LLNA is not without limitations.
There are strong scientific, ethical, and legislative reasons why it is important to ensure that opportunities to Reduce, Refine, and Replace the use of experimental animals in research and investigative studies are exploited quickly and effectively. Probably the most appropriate code of practice when considering the use of animals in research is to pose a number of questions: (a) is the issue being addressed legitimate and important? (b) is it possible to address the question effectively without the use of experimental animals? If the answer to the latter question is no, then the third question is; (c) what is the best and most appropriate experimental design to ensure that the principles of the 3Rs are adhered to and a robust answer achieved?
Translating this to the theme of this article, the assumption is that it is not currently possible to conduct an evaluation of the skin sensitizing activity of chemicals without the use of animals (either guinea pig tests or the LLNA) that is fully accepted by regulators.
Although this assumption will be tested further under question 4, it is certainly the case that no validated, non-animal methods for the identification of skin sensitizing chemicals are currently available. As a consequence, and in line with the scientific, ethical, and legislative imperatives (such as the 7 th Amendment to the Cosmetics Directive in the EU) that are driving interest in developing non-animal methods, there has been a very substantial investment in alternative strategies for skin sensitization testing.
A detailed survey of the many approaches is unnecessary here, and more information is available from the review articles cited above. The important point, however, is that alternative methods, in most cases, are based upon an attempt to identify biological properties or structural motifs of chemicals that are believed to be required for the acquisition of skin sensitization. The palette of strategies that has been considered is based on the understanding that for a chemical to cause skin sensitization a number of things have to be achieved, or biological/biochemical hurdles must be cleared. These include, but are not limited to: (a) the chemical gaining access to the viable epidermis, (b) the stable association of chemical with protein to form a complete antigen, (c) the activation, mobilization, and migration of cutaneous dendritic cells (DC) for transport of antigen to regional lymph nodes, and (d) the activation within lymph nodes of responsive T lymphocytes. Another approach has been to apply a systems approach to model in silico the key chemical and biological pathways that drive the induction of sensitization (Maxwell and Mackay, 2008).
For the most part, the alternative methods being explored (in vitro and in silico) are based upon evaluation of the ability of a chemical to provoke one (or more) of these required events or processes. That strategy appears appropriate, but it is crucial to bear in mind that the ability of a chemical to clear any one of those biological or (bio)chemical hurdles does not necessarily mean that it should be classified as a skin sensitizer. Does the ability of a chemical to cause covalent bonds with protein necessarily signify that it will cause sensitization? Similarly, is there any reason to believe that if a chemical is capable in vitro to cause the activation of DC that this property alone will be sufficient to translate into skin sensitizing activity in vivo? Of between fundamental mechanistic research and research driving the application of the test by industry and regulatory authorities on the one hand, and research applying established knowledge focused on the design of new test methods. If so many investigators are applying their skills to the development of alternative predictive tests, where will the transformational research that will provide the basis for really innovative developments in the future come from?
As an illustration, one area of investigation that has attracted considerable interest in the context of new approaches to skin sensitization testing is dendritic cell (DC) biology. There are a variety of proposed test methods based on the use of cultured DC, or DC-like cell lines, the theory being that exposure of such cells to skin sensitizing chemicals, but not to non-sensitizers, will provoke functional or phenotypic changes that will serve as biomarkers of sensitizing activity and as readouts for cell-based assay systems Arkusz et al., 2010;Ashikaga et al., 2010;Ouwehand et al., 2010;Python et al., 2007;Johansson et al., 2011). There is no doubt that some of these assays perform rather well and show real promise. Nevertheless, the approach is predicated on a (largely unproven) assumption that the impact of chemical allergens on DC or DC-like cells in culture (usually at cytotoxic concentrations) is reflective of the changes induced in epidermal Langerhans cells (LC) and dermal DC during the initiation of sensitization in intact skin. However, despite much enthusiasm for this approach, and despite considerable investment that has supported a wide range of proposed assay methods, very little is known about how chemicals cause changes in cultured DC, or what relevance, if any, this has to the acquisition of sensitization. In the rush to develop a test or a new method intriguing and important questions about fundamental mechanistic biology are ignored -or at least put aside.

What should be the future research imperatives?
There needs to be a realignment of "pure" and "applied" research in skin sensitization, with increased emphasis on exploring some of the important uncertainties and intriguing unknowns. Among the many issues that remain to be clarified are the following: -The balance achieved in the skin and regional lymph nodes between the immunostimulatory, promotional, and regulatory signals delivered by discrete populations of DC, and how that balance impacts the acquisition of sensitization.

-The role of regulatory T cells (Treg cells) in controlling and
constraining the induction of skin sensitization, the elicitation of ACD, and the relationship of Treg cells with effector T lymphocytes in determining the net vigor and quality of immune responses to contact allergens. -The influence of the ways in which haptens interact with target proteins on the development of skin sensitization. This list is merely indicative, certainly not exhaustive. Other investigators doubtless will be drawn to other research questions. However, the common thread in the examples highlighted above is that they each have the potential to inform our understanding of the factors that govern sensitizing potency. The balance achieved LLNA. If that is what success looks like, then a frequently asked question is whether it is reasonable to expect that a single in vitro approach or test method will provide the required level of certainty. Some speculate that it will be necessary to develop a suite of methods that collectively provide a basis for making judgments about sensitizing potential. This sounds sensible, but relying on a battery of assays may prove to be technically demanding and experimentally unwieldy.

Is the international scientific community marshaled in the right way to make real progress in this area?
The most important opportunities in applied toxicology, including the development of alternative test methods applicable in an industrial context and for regulatory purpose, derive from an investment in "pure" research and an improved understanding of relevant cellular and molecular mechanisms. We mentioned above that there has been a huge investment, particularly in Europe, directed at promoting the development of alternative methods for skin sensitization testing. There is no suggestion that the motives driving this investment are anything other than laudable, but it is relevant to question whether the focus of that investment is the most likely to deliver the necessary breakthroughs. There has been too much emphasis placed on supporting applied research, often focused narrowly on the design and development of new methods. In recent years, too many new approaches have been proposed. In most cases, however, few of them have proven useful for a workable, full-replacement strategy.
Why have we seen a change of emphasis in skin sensitization research from characterization of the mechanisms through which cutaneous immune and inflammatory responses are induced and orchestrated, to new test development? A number of factors have influenced this: -Everyone wants to develop a test. Even a superficial survey of platform and poster presentations at toxicology conferences reveals an ever-increasing number of papers describing attempts to develop "alternative" test methods for the identification of skin sensitizing chemicals. Naturally, many of these communications have merit, but not all. It is evident that some investigators do not have a clear understanding of what "alternative" is really required nor what is required by those charged with making decisions about the safety of new chemicals or products. -Research follows the money, and it is clear that many investigators have found it necessary to develop more research themes focusing on applying current knowledge for test development to attract funding. -In certain commercial sectors, and particularly the cosmetics industry (because of the deadlines imposed by European regulation), there is a very clear and pressing need to develop nonanimal methods so that new innovation can be supported when it is no longer possible to use animal tests. A case is not being made that all currently supported research in skin sensitization is without value. Indeed, there have been important achievements. However, there is now an imbalance zation, (f) the balance achieved between activated epidermal LC and activated dermal DC, and (g) the amount of antigen delivered to draining lymph nodes.
Again, the options listed above are not exhaustive, and there may be a number of other events that influence overall potency. Nevertheless, it is the case that we really have little understanding of how events induced in the skin and regional lymph nodes in the minutes and hours following topical encounter with a contact allergen shape the T lymphocyte response that will be induced. Tackling this question will not be easy, but a research investment in this area might elucidate the pivotal events and processes that determine the effectiveness of immune responses to contact allergens -and may also provide an appreciation of how the vigor of immune responses in general is controlled. Certainly, an understanding of the key events that impact on skin sensitizing potency would be of enormous value in considering novel approaches to testing that would deliver not only hazard identification but also an assessment of relative potency. In conclusion, therefore, the proposal is that there should be a greater investment in tackling some of the important questions remaining about the way in which the acquisition of skin sensitization is induced and orchestrated. In addition, a high priority should be given to investigation of the chemical, biochemical, and immunological events that determine the relative potency of skin sensitizing chemicals.

Skin sensitization testing in vitro -can we do it already?
There are currently no formally validated methods for the (hazard) identification of contact allergens using non-animal methods. But an important and interesting question is this: Leaving aside considerations of validation and regulatory acceptance, are we already in a position to make sound judgments regarding the skin sensitizing potential of chemicals? If we were to effectively marshal our collective know-how about skin sensitization, together with access to data generated by selected in vitro approaches, would it be possible to achieve something approaching a 90% overall accuracy of prediction of skin sensitizing activity -a performance in line with the LLNA or better? Such a hypothetical scenario would have two parts.
The first of these would be to bring together a small group of seasoned investigators with experience of skin sensitization and making judgments about the sensitizing potential of chemicals. This expert panel would include those with expertise in QSAR and aligning sensitizing potential with structural motifs and physicochemical properties. That expertise could be supplemented by access to one or more expert systems that seek to predict skinsensitizing activity as a function of chemical structure.
The second element would be the availability of data generated by selected in vitro tests, albeit test methods that have not yet been validated (although those mentioned are in the latter stages of formal evaluation). There are several assay systems from which to choose, including the following: peptide binding assays (Gerberick et al., 2004(Gerberick et al., , 2009Troutman et al., 2011) and based upon the Nrf2/Keap 1 electrophile-sensing pathway the KeratinoSens assay ; and the CeeTox Assay (McKim et al., between stimulatory and regulatory signals from DC, the balance achieved between effector T lymphocytes and Treg cells, and the impact of the kinetics and selectivity of the interaction of chemical allergens with skin proteins are all strong candidates with regard to influencing sensitizing potency.
Other exciting research themes could also be identified. There is no shortage of relevant challenging and exciting areas of research in skin sensitization. Addressing issues such as those listed above will not necessarily lead directly and immediately to the identification of alternative test methods, but there can be no doubt that the increased understanding of relevant immunobiological mechanisms that would result from such research would drive innovation and open up the development of new strategies.
Sensitizing potency, however, is currently the key challenge. It is now understood that contact allergens vary by up to five orders of magnitude with respect to their relative skin sensitizing potency. In practical terms, this means that with potent chemical allergens only very low levels of exposure are required for the development of sensitization, whereas with weak allergens repeated high-level exposure may be required for sensitization to develop. The phenomenon is clear, but we really do not understand why this is -and what specific factors govern potency. The question is of great academic interest but is also of considerable importance in the development of new test methods. If an in vitro assay is going to provide useful information about relative potency, then there will need to be readouts that correlate quantitatively with sensitization and are reflective of dose-response relationships.
At a relatively simplistic level it is clear that the extent to which sensitization is acquired is associated with the vigor of T lymphocyte responses in regional lymph nodes draining the site of exposure to the inducing chemical allergen (Kimber and Dearman, 1991;Kimber et al., 1999a). This is not unexpected, because skin sensitization is mediated by T lymphocytes, and the greater the level of proliferation in draining lymph nodes the larger will be the pool of antigen-responsive T cells. However, as alluded to above, it has to be acknowledged that, in addition to the extent of clonal expansion, the effectiveness of skin sensitization will likely be impacted by the quality of the T lymphocyte response. One qualitative aspect of that response is the balance between effector T lymphocytes that will drive the elicitation of ACD, and Treg cells that will down-regulate and constrain sensitization. In addition, the overall effectiveness of sensitization may be influenced by the "breadth," or clonal diversity, of the T lymphocyte response.
One can conclude, therefore, that the main influence on the effectiveness of skin sensitization will be the quantity and quality of the T lymphocyte response generated. However, this does not provide any indication of the events induced following encounter with a contact allergen that shapes the response. There are several factors that individually, or in concert, may impact the vigor and quality of the T lymphocyte response. These include: (a) the speed with which the chemical reaches the viable epidermis, (b) the nature of "danger signals" elaborated, (c) the kinetics of association with target proteins, (d) the promiscuity of chemical interaction with proteins, either in terms of number of proteins with which adducts are made, and/or at the level of amino acid selectivity, (e) the kinetics of LC and DC activation and mobili-The above is a pragmatic approach and has not been tested adequately in practice; there has been only a single partial attempt -by Natsch et al. (2009). Nor does it claim to distinguish between threshold events and those that bear a direct quantitative relationship with sensitizing potency. Nevertheless, it does at least provide a framework for how some assessment of potency might be informed by the use of readouts from in vitro tests combined with appropriate SAR analyses.
It is relevant here also to highlight that a number of other strategies have been proposed for informing skin sensitizing potency without recourse to animals. These include the development and deployment of appropriate mathematical models (Maxwell et al., 2011;Maxwell and Mackay, 2008), the development of an Integrated Testing Strategy (ITS) for skin sensitization in the form of a Bayesian Network (Jaworska et al., 2011;Maxwell et al., 2011;Maxwell and Mackay, 2008) and a tiered approach combining a keratinocyte-based test for identifying skin sensitizers and an epidermal equivalent-based potency test (dos Santos et al., 2011;Galbiati et al., 2011).
Notwithstanding the crafting of theoretical frameworks for considerations of potency that may or may not work in practice, the answer to the question posed in this section is that a nonanimal solution to hazard identification alone (although being a significant achievement) is insufficient for a full safety evaluation and risk assessment. This view is clearly reflected by the recent expert review of a European Expert Group (Adler et al., 2011).

What needs to change?
Against the background of the issues described above, we have listed what we believe to be the most important changes that are required to promote the development of effective non-animal methods for the assessment of skin sensitizing activity. In no particular order the key issues are: -The need for a realignment of skin sensitization research so that there is greater emphasis on exploring basic mechanistic aspects in the expectation that this will yield information and understanding that, in time, will provide a platform for real innovation and, hence, new ground-breaking solutions. -Such a research investment would provide a much clearer understanding of the factor(s) that serve to determine the potency of skin sensitizing chemicals. -The need to bring a greater realism to some within the scientific community who are seeking to develop novel test methods. It needs to be understood and appreciated that for a new method to be valuable it has to be technically robust, perform reliably, and offer the required level of predictive performance. -In tandem with the above, there needs to be a greater willingness among some test developers to take a more dispassionate approach to the evaluation of putative tests. There is a need for a critical evaluation of the strengths and limitations of novel methods compared with existing in vivo models, even with their limitations. -For the evaluation and validation of new methods (of whatever type) there is a need to evaluate specificity, sensitivity, 2010), as well as a variety of cellular assays based on the use of cultured DC or DC-like cells (Aeby et al., 2010;Ashikaga et al., 2010;Reuter et al., 2011;Sakaguchi et al., 2006;Johansson et al., 2011). It would be interesting to evaluate, prospectively and with an unbiased set of chemicals, how such an expert panel (with access to QSAR models and data derived from selected in vitro tests) would fare compared with the LLNA or with human ACD. The foregoing is not a cri de coeur for the use of unvalidated tests in the safety assessment process (which would be inappropriate). Rather, it should be viewed as a reflection of how near we perhaps are to being able to identify skin sensitizing chemicals without recourse to animal experiments -if there is a willingness to align the experience and expertise which is already available with the outputs of selected test methods.

Is hazard identification alone good enough?
If resources are marshaled carefully, the identification of skin sensitizing hazards, without the need for animal tests, should be a realistic goal. Hazard identification might be sufficient to satisfy regulatory requirements (and, of course, where there is no skin sensitization hazard, further work will be unnecessary). However, just as safety evaluation cannot be completed solely on the basis of exposure data, the absence of information about relative potency for identified skin sensitization hazards will not support the development of accurate risk assessments or enable meaningful risk management. This is important because over several decades of the implementation of regulatory identification of skin sensitization hazard, there is no evidence of any impact on the clinical burden of allergic contact dermatitis. One could go so far as to argue that efforts to develop effective risk assessments based largely or solely on exposure data are doomed to failure, as it is rarely possible to link specific exposures with the development of allergic contact dermatitis.
One answer, therefore, is to make a greater research investment in the expectation that a more complete understanding of the immunology and biochemistry of skin sensitization will disclose the pivotal events in determining potency. Another approach is to consider how information deriving from currently available non-animal models for sensitization testing might be used to rank chemical allergens according to potency. Previous exercises have explored how, in theory at least, it might be possible to derive an estimate of overall potency by integrating information from two or more of several in vitro approaches. One strategy explored was to assign chemicals scores on the basis of whether there was a structural alert, and on relative activity in in vitro tests configured to measure the ability of chemicals to: (a) gain access to the viable epidermis, (b) form stable associations with peptides or proteins, (c) stimulate the activation/maturation of DC or DC-like cells, or (d) provoke proliferative responses by cultured T lymphocytes. In some cases the scores were binary (that is 1 or 2; for epidermal bioavailability and structural alerts). For other readouts a scale of 0 to 5 was used. Based on this paradigm, the relative sensitizing potential of a chemical would be calculated as the product of individual scores (Basketter and Jowsey et al., 2006).
A second area where there has been considerable progress has been in defining the phenotype, function, and impact on the induction of skin sensitization of discrete functional subpopulations of cutaneous DC. Of particular interest is the interplay between epidermal LC and dermal DC (Bobr et al., 2010;Clausen and Kel, 2010;Kaplan, 2010;Kaplan et al., 2008;Kimber et al., 2009;Noordegraaf et al., 2010). Our increased understanding of the roles played by skin DC not only in initiating but also in orchestrating cutaneous immune responses to contact allergens may pave the way to more sophisticated and more informative DC-based assay systems.
An investment in high quality research addressing important questions will always pay important dividends -and that holds true for skin sensitization and our need to drive new innovation in safety assessment.

Recent investments in the development of alternative in
vitro methods for skin sensitization hazard identification have resulted in the design of a substantial number of potential assays. Those that show promise should be evaluated as soon as possible.

Progress in the development and refinement of in silico
approaches to skin sensitization testing (mathematical modeling and computational chemistry) should be accelerated. 3. The main priority now is to develop non-animal methods for assessment of skin sensitizing potency of contact allergens. In this context it is important to identify biomarkers or chemical signatures that are quantitatively associated with the acquisition of skin sensitization. 4. The ability of existing in vitro tests, QSAR methods, and other testing strategies to inform skin sensitizing potency, in addition to identifying skin sensitizing hazards, should be investigated. 5. New strategies for potency assessment based on approaches such as: (a) an appreciation of the balance between promotional and regulatory signals by skin DC, (b) an understanding of the impact of the vigor, quality, and breadth of T cell responses on the development of sensitization, (c) the design of appropriate mathematical models, and (d) integrated testing systems should be explored. 6. An investment in developing a more detailed understanding of the cellular and molecular events that initiate, orchestrate, and control immune responses to skin sensitizing chemicals should be encouraged. 7. An investment in activities facilitating the application of the emerging tests by industry and regulatory authorities and assessing the limitations and strengths of the tests before full validation should be considered. and overall accuracy with a gold standard dataset. In this case, that translates into a dataset that is populated by chemicals where there is sound evidence for the presence or absence of significant skin sensitizing potential in humans. -Finally, there is a critical need for a general acknowledgement that the complete replacement of animal methods (such that safety assessments remain at least as effective as they are currently) requires that alternative approaches inform both hazard identification and assessment of potency. At present, formal validation activity addresses only the first of these.

Conclusions and recommendations: skin sensitization
The purpose of this chapter was to provide a critical and partisan appraisal of the current landscape with regard to skin sensitization testing. There is no doubt that there have been considerable achievements. Peptide binding assays continue to evolve and appear very promising. Cellular assays based on induced responses by DC, DC-like cells, and other cell types have considerable momentum currently, and there are three such assays currently undergoing formal validation in Europe. Efforts continue with the development and evaluation of SAR paradigms. New opportunities based upon an appreciation of the activation of the Keap 1/Nrf2 pathway are being explored . The Keratino-Sens assay recently has completed an inter-laboratory evaluation and has been submitted for formal validation (Andreas Natsch, personal communication). In addition, an inter-laboratory evaluation of a tiered-approach combining the IL-18 assay (Galbiatti et al., 2011) and the epidermal equivalent potency test (dos Santos et al., 2011) is currently ongoing. So progress continues, and in all likelihood it soon will be possible to configure testing strategies based on accumulated expertise and experience combined with data from those in vitro and in silico approaches that are found to perform well, to identify skin sensitizing hazards without the use of animals. The aim will be to ensure that such predictive approaches are at least as accurate or probably better than the LLNA -and that also should be achievable. However, as highlighted elsewhere, there is more to achieve and more that needs to be achieved. An increased investment will be needed in research focused on providing a more detailed understanding of the cellular and molecular mechanisms through which skin sensitization is induced and orchestrated. The dividends of that research investment will provide the momentum for truly innovative solutions to the unaddressed challenges and that will inform our understanding of the biological/ biochemical bases that determine relative potency.
Two examples serve to illustrate the point. Work by Stefan Martin (University of Freiburg) and others has provided new insights into the role of the innate immune system in skin sensitization, and interactions between inflammatory reactions and adaptive and innate immune responses. This research will help define the danger signals and cofactors that are required for the effective acquisition of sensitization (Lass et al., 2010;Martin et al., , 2011Martin and Jakob, 2008;Schmidt et al., 2010;Weber et al., 2010). mammalian species (non-human primate or dog). The mechanistic and conceptual basis for RDT may be broad, and it is not well understood for many compounds. In some cases, it may be due to a build-up of toxic substance(s) in one or more sensitive areas of the body. In other cases, the changed compound concentration is not a driving factor. In such cases, defense mechanisms may be exhausted, the tissue may be altered by regulations and counter-regulations, or immunological reactions involving the specific or non-specific immune system may be triggered. Besides assessing obvious signs of toxicity and organ specific toxicity, a number of other endpoints are evaluated, including body weight, hematological parameters, urinary constituents, and histopathology of each organ system. RDT testing is thought to be extremely important in toxicity testing, as it is considered to model repeated exposures to lower doses of a compound, which is more likely to occur in a real-world situation than short term exposure to high doses. Moreover, this approach also offers the opportunity to assess recovery in between dosing. Toxicities not seen in acute testing or in reproductive toxicity testing may be revealed by RDT tests.
Regulatory risk assessments for chemicals, pharmaceuticals, and cosmetics, including REACH, TSCA and the FDA and EMA guidances, respectively, require RDT testing as an integral part of assessing the potential risks of a substance. The EU Cosmetics Directive (Cosmetic Directive 76/768/EEC), by adopting its 7 th Amendment (2003/15/EC), has already instituted an animal testing ban, and as of January 1, 2013, a marketing ban will go into effect for any new substance tested on animals. Thus, the need for alternative methods is clear. Besides these regulatory reasons, animal testing is considered ethically questionable by many, and it is expensive. Most importantly, the present animalbased regulatory tests do not provide specific information on human hazard, and they fail to provide a mechanistic rationale that would explain toxicity and allow science-based predictions. This problem is currently circumvented by the introduction of safety factors, and the need to move to more specific and useful results for toxicity testing is clear.
The classical 1:1 replacement approaches have found their limit when faced with the problems of assessing RDT. The adverse effects can be based on a complex web of disturbances in multiple target tissues, and the interplay between various pathways and systems requires new modeling approaches, integrating multiple models, pharmacokinetic parameters, and a large battery of mechanistic tests designed to elucidate PoT.
In 2010, with the 2013 deadline looming, experts in the various areas of toxicology were invited by the European Commission and stakeholders (such as industry, non-governmental organizations, EU States, and the Commission's Scientific Committee on Consumer Safety (SCCS)) to analyze the status This chapter provides an overview of possibilities for replacing animals in repeated dose toxicity (RDT) testing and recommendations to improve and speed up the process of that replacement. The importance of RDT testing in the safety evaluation of chemicals, agrochemicals, pharmaceuticals, and cosmetics cannot be overestimated. RDT evaluation examines the potential for chronic toxicity and for organ-specific toxicities not seen in acute testing. The present testing schemes are based on rodent or non-rodent studies performed for 4 weeks (subacute toxicity), 13 weeks (subchronic toxicity), or 26-102 weeks (chronic toxicity). The tests, as they currently exist, are often followed up with further testing to more clearly define the nature of initial findings. Nevertheless, the false positive and false negative rate with respect to human adverse effects may be as high as 50% (Olson et al., 2000). The focus of toxicity testing in general should be to capture all potential toxicants and to assess their hazard. Species differences imply that the use of animals to assess toxicity is probably not the most efficient or detailed way to the end of safe and effective chemicals, pharmaceuticals, and cosmetics. Moreover, statistical issues, extrapolation from high to low doses, and other difficulties reviewed many times contribute to the weaknesses of animal-based safety testing. This chapter provides information on current and potential future in vitro and in silico approaches for assessing the major endpoints used in repeated dose toxicity. It also contains suggestions for improving and implementing those tests, and it outlines a roadmap forward in the field of alternatives to RDT testing.
The objective of RDT testing is to assess the potential hazard of a chemical after long-term exposure. The goal of such studies is to define a No Observed Adverse Effect Level (NOAEL) for the compound in question. Currently, the testing is usually performed in a rodent (usually rat) and potentially a non-rodent is something that could begin today and quickly show measurable results. The suggestion to create a consortium with a vision of generating such a database may sound naïve in the present competitive environment. However, this type of exercise could lead to a clear and definitive advance in the prediction of toxicities without further use of animals. Setting up such a consortium would require strong regulatory pressures and incentives, particularly with regard to breaking down of barriers to data sharing. It would be well worth the effort, however, to come closer to the larger goal of animal-free toxicity testing.
There are several examples of this type of initiative already: The U.S. Environmental Protection Agency (U.S. EPA) has recognized the relevance of collecting high quality regulatory in vivo data and making it accessible for cross-chemical computational toxicology analysis to create an in silico assessment for chemical compounds. They have created a U.S. EPA ToxRef Database that profiles chemicals based on chronic toxicity. This database has already proven to be very valuable (Liebsch et al., 2011;Martin and Jakob, 2008). Another example of relevant collaboration is the Innovative Medicine Initiative (IMI) project eTox 1 . This consortium aims to create a drug safety database from industry legacy toxicology reports and public data that will allow in silico prediction of toxicities.
Linking these types of efforts across different business and product areas could confirm common pathways of toxicity or help identify new ones. The OpenTox 2 initiative could be a starting point; however, OpenTox currently only works on open collaborations, communications, and advisory boards. To function optimally, and have more impact in creating predictive assays and tools, it would need to add more rigorous experimental data sharing.

Integrated testing strategies
It is clear that, at present, we cannot screen for repeated dose toxicity using only alternative methods. However, the implementation of decision trees and tiered approaches, i.e., integrated testing strategies (ITS), will contribute greatly to the reducing the use of animals (Grindon et al., 2008;Vanhaecke et al., 2011). This is something that can begin now and be modified as more and more alternative tests become available. Approaches such as the decision tree proposed by Vanhaecke et al. (2011) for predicting liver toxicity represent a very good start. The authors propose integrating computational and cell-based toxicity information, along with pre-existing data, to arrive at a theoretical NOAEL and assess an acceptable margin of safety. Similar strategies could be developed for all relevant organ systems, and be implemented in a variety of sectors, to achieve meaningful validation. However, such ITS will need to be broadly discussed among different experts in the toxicities assessed and in the technologies used. Before implementing the ITS, a very careful evaluation of the assays is needed. For example, Vanhaecke et al. (2011) suggests us-of alternative methods and to estimate the time necessary to achieve a full replacement of animals in the cosmetic industry. An extensive and detailed document (Adler et al., 2011) summarized their conclusions. In that document, virtually all of the available in vitro and in silico methods were considered, and the unanimous conclusion was that seven to nine years will be required before there will be replacement of animal testing for skin sensitization and five to seven years to finalize methods to predict toxicokinetics. A timeline for full animal replacement could not be set for repeated dose toxicity, carcinogenicity, and reproductive toxicity. An additional group of experts in the field of alternative methods has reviewed the outcome of this report, and they came to the same conclusion . For this reason, this chapter will not list the existing tests and methods again but rather will discuss current gaps of knowledge and ways forward to arrive at an alternative testing strategy within a reasonable time period.
Although the above documents focused specifically on testing for cosmetics, many of the underlying themes apply to all areas of toxicity testing, including pharmaceuticals, agrochemicals, and industrial chemicals. The tests requested by the various regulatory bodies for RDT evaluation are in most aspects equivalent. Also, independent of the commercial product area, the mechanisms that are the basis of toxic effects are expected to be the same.
This chapter aims to place the available methods (extensively covered in previous papers) in a chronology, providing a roadmap for replacement of RDT testing based on what can and should be done now and what will require more time and effort. Thus, the methodologies reviewed below are in order of current feasibility, with particular note of steps that must be taken to move each methodology closer to fruition. The final section summarizes the findings of this group of authors and provides recommendations for moving forward.

Create new alliances
It is in the interest of every party, including chemical, pharmaceutical, food, and cosmetics companies, to decrease the number of animals used, not only for ethical reasons but also from a budgetary point of view. This is particularly the case when regulators request a more expensive second species study in repeated dose toxicity testing. To combine efforts in the challenging task of reducing the use of animals to predict for repeated dose toxicity (and, in general, any type of toxicity) would dramatically increase the possibility of success. Each company (regardless of the area of business), and each of the different regulatory agencies, has a database of information on toxicity, as well as investigative data that could advance the development and validation of in vitro/in silico systems immensely; inform pathways of toxicity based on in vivo data, and speed the process of reducing and replacing animals in repeated dose toxicity testing. This collaboration and data gathering initiative ner, will improve our understanding of the cellular pathways involved in toxic processes and thereby improve our ability to predict toxicity. Understanding alterations in signaling pathways, and the role of these alterations in the activation of toxic effects, is crucial not only to elucidate how and why a toxic effect is occurring but also to extrapolate the effect to and from an in vitro system. The task of identifying relevant pathways, understanding them, and demonstrating correlation with toxicity is a challenging one, particularly if we consider that often it is the interaction between pathways and their localization that results in toxicity or protection of a cell (Latta et al., 2000;Volbracht et al., 1999).
Cells receive, process, and respond to information and signals through many distinct molecular pathways, which permit them to function properly. Often, components of several different pathways interact, resulting in signaling networks to create these cellular responses. A toxicological response often is induced by a disruption of these signals. For example, protracting a signal for a long period of time could lead to a toxicological response rather than a physiological one. Similarly, the strength of the signal could be enhanced by a toxicant and, therefore, unbalance the normal cellular responses or disrupt feedback loops. In considering the analysis of signaling pathways, it should be taken into account that often it is not a new pathway that is activated but rather a normal signal that is disrupted by the toxicant (see Fig. 4.1). Therefore, it is necessary to have an idea of the threshold that must be crossed to exert a toxic effect. Together with the identification of the pathways themselves, a quantification of the major players in the pathways should be performed. The identification of common pathways of toxicity, and the quantification of the signals therein, should continue to be the major focus of toxicological research, as not only will it move the field forward in understanding, it will assist in the development and validation of the in vitro and in silico systems we need to replace animals in repeated dose toxicity testing (Stokes and Wind, 2010).
ing stem-cell-derived hepatocytes, which currently are not the best model to screen for liver toxicity (Guguen-Guillouzo et al., 2010). The use of reference compounds appropriate to each organ system to test the ITS would be particularly useful in such a validation.
In some cases, such as in reducing animal use for REACH in the chemical industries, decision trees and tiered approaches may be well characterized and ready to be implemented. However, in other areas, such as in the pharmaceutical industry, these approaches are still in a validation phase and have yet to prove their ability to predict and manage risk at an early stage. Technologies and assays used in any ITS would be regularly reviewed and revised, of course, to ensure their continuous development and improvement.
Of note, and rarely mentioned, is the fact that ITS can and should incorporate human data, including epidemiological, genetic, and medical/clinical data, whenever applicable. As for data from in vitro and in silico systems, standards must be set to ensure the use of quality and comparable data in each system, bearing in mind that the overarching goal is to predict human toxicity. The use of this data cannot and should not be ignored in developing testing approaches.

Signaling pathway identification and analysis
In 2007, the United States National Academy of Sciences (NAS) published a report Toxicity Testing in the 21 st Centurya Vision and a Strategy, which envisioned a new approach to toxicology (NRC, 2007). The report called for the application of new and advanced technologies and biological knowledge to move toxicology forward. One focus of the report was the identification of common pathways of toxicity. New frontiers of science, such as systems biology, bioinformatics, highthroughput screening, high-content screening, transcriptomics, proteomics, and metabolomics, applied in a synergistic man- pathway, and a more detailed analysis of pathways inhibition should be performed. Knocking out genes is another approach. Different strategies have been followed to do this systematically in murine and human cells. One modern approach is the generation of haploid stem cells, and the knockout of every gene of the genome in such cells, which then can be differentiated to any given cell type. More traditional approaches use cells from humans with different mutations, and a sophisticated variant uses such cells together with a derived cell type in which the mutation has been repaired. A completely different and complementary strategy useful for PoT mapping is the visualization of pathway activities by cellular reporter constructs.

Case studies
To bring together new technologies and existing toxicological knowledge, extensive case studies will take a central role, and their importance cannot be overestimated. One important basis will be an assembly of compounds causing RDT. Among these, the ones not identified in acute studies need to be identified. Many such examples are known from the pesticides field (Spielmann and Gerbracht, 2001;van Ravenzwaay, 2010). A first selection would involve those that result in human toxicity, or for which the animal toxicity is seen in the range of human exposure. The case study, then, would examine whether toxicity would have been predicted correctly by alternative methods. The cases where this approach failed are especially interesting. This should give an incentive for the establishment of methods that fill the gap. It would be worthwhile to promote such studies in EU-funded research consortia. The road to the future in this area leads through learning from the past. Such efforts need support as goal-oriented applied research. They are not funded by classical scientific funding bodies, which appear to consider them too applied and not sufficiently innovative. Unfortunately, it is not broadly accepted that such work forms the basis for large innovations in the field of risk assessment.

A brief overview of in vitro models
The need for in vitro systems, which can address all areas covered in RDT testing, is obvious. This section provides a brief overview of the available technologies and highlights some barriers and considerations. The cellular tools currently available are primary cultures or established cell lines from animals or humans (Skelin et al., 2010).
Primary cultures obtained from animals have three major limitations, the most obvious one from the point of view of reducing animal use being that animals are used. The number of animals used to perform experiments with primary cultures is fewer than for in vivo testing, but often a significant number of animals must be sacrificed to obtain a primary culture, particularly for difficult-to-culture cell-types. A second limitation of primary cultures is the short life span of the cultures. With the exception of some neuronal systems (Viviani, 2006) that have There are a number of mechanisms by which these pathways can be investigated -many of them pioneered in medical research in order to identify pathway alterations that result in human pathologies. These include technologies from knockout yeast and mice, to antibodies and perhaps RNA interference (RNAi) (Moffat and Sabatini, 2006). The concept of each of these technologies is that a specific member of the pathway in question can be perturbed, which can lead to identification of important pathway members, how they interact, and to what degree.

RNA Interference (RNAi) and other interventions to define pathways of toxicity (PoT)
RNAi technology is based on the concept that by introducing a sequence-specific iRNA that will lead to a post-transcriptional gene-silencing process, one can identify important pathway interactions and mechanisms. This tool is important because knocking down (KD) the gene(s) inhibits the entire pathway, allowing identification of the proteins that play a role in signaling through post-translational modifications and monitoring of their role in signaling (Virshup and Shenolikar, 2009). Different specific RNAi sequences can result in different percentages of KD in the cells, giving hints as to the threshold of inhibition or activation needed to create a toxic effect in the signaling pathway. In other words, this technology is based on a "quantitative" indication of the signaling pathway.
RNAi in combination with bioinformatics tools can provide even more knowledge on the signaling pathways involved in toxicity. There are several examples of software and algorithms created to correlate the KD pathway from an RNAi experiment with other pathways (Kaderali et al., 2009). Expanding the genetic analysis of the KD genes can achieve a similar result: A transcriptomic evaluation of gene changes within a pathway can provide an overview not only of the specificity of the inhibition, but also of possible correlations among pathways that could highlight new toxicological interactions, particularly if included in a bioinformatics analysis.
RNAi has proven useful in areas such as in identifying pathways involved in cancer and apoptosis induction (MacKeigan et al., 2005). However, applying this technology directly to the in vitro systems mentioned above might not be immediately feasible, since thus far RNAi methods are well established in cell lines and in dividing cells but are difficult to use in non-dividing primary cultures and ESC-derived cells. Luckily, the signaling pathways involved can be investigated in non-primary, non-ESC cell lines, and the discoveries made there can be applied to the in vitro technologies to improve predictability. RNA interference has been described here in more detail as an example for a pathway-mapping tool. Other approaches are available and are being developed. One involves the selective inhibition of pathways by chemicals. These act especially fast, and new pathways have been identified in this way (Falsig et al., 2004b;Lotharius et al., 2005;Lund et al., 2005). A big advantage is that they can also be combined for more sophisticated pathway mapping (Falsig et al., 2004b). The major disadvantage is that often they do not selectively inhibit only a specific and Guilak, 2003). This could be an interesting area to explore further, particularly considering the accessibility of adipose tissue from surgical operations.
Embryonic stem cells (ESC) are isolated from the inner cell mass of 5-6-day-old blastocysts (Davila et al., 2008) and are fully pluripotent, meaning they are capable of giving rise to most tissues of the organism, including germ line cells. Under proper differentiation they should be capable of generating all the cell types present in an organism.
In 2006, Takahashi and Yamanaka published a breakthrough in stem cell biology: mouse somatic cells that could be reprogrammed back to pluripotent stem cells. The era of induced pluripotent stem cells (iPSC) had begun (Takahashi and Yamanaka, 2006). One year later, the same group showed that human somatic cells can also be reprogrammed into pluripotent stem cells by transduction of four defined transcription factors: Oct3/4, Sox2, Klf4, and c-Myc. The derived cells had the same morphologic, genetic, and epigenetic characteristics as stem cells (Takahashi et al., 2007). However, before considering the use iPSC for clinical or toxicological purposes, the issue of the mechanism of reprograming must be solved, since this process implies the use of viral transduction (which leads to a safety concern for clinical applications) and the activation of transcription factors and oncogenes present also in cancer stem cells (a concern for both clinical and toxicological use) (Jaenisch, 2009). That said, various papers already have been published reporting that reprogramming of somatic cells can be achieved without using viral delivery of reprogramming factors and evaluating the relevance of c-Myc and Klf4 in this process. A reliable methodology for doing so across many labs would increase the potential for these cells in both clinical and toxicological applications (Cox and Rizzino, 2010;Jaenisch, 2009). Thus, iPSCs may have great potential for predicting toxicity. They can be a source of potentially all tissues derived from various human populations with different pharmacogenomics profiles and a variety of genetic variabilities.
Stem cells represent a cellular system that has several advantages compared to stabilized cell lines and primary cultures, including normal genetic profile, normal growth, uniform cellular physiology, and pharmacology (McNeish, 2007). They have a number of unique features that make them attractive and potentially valuable for toxicological screening (Ameen et al., 2008;Davila et al., 2008;Jensen et al., 2009;McNeish, 2004): a) Stem cells divide and renew themselves for a long period of time, and therefore they can provide an almost unlimited supply of cells. Since, like all in vitro dividing cells, stem cells can accumulate mutations, karyotyping the cells is necessary after long periods in culture to confirm genetic normality prior to use in testing. b) Stem cells are pluripotent and therefore potentially able to differentiate into any human tissue. This opens the possibility of creating different cell types from the same organ in one culture. For example, an ideal in vitro liver toxicity system would have a "liver-like" organ that includes not only hepatocytes but also all other relevant cells, such as Kupffer cells, a relatively long life in vitro, many cultures have a lifespan of 2 to maximum 14 days (Volz et al., 1991). Additionally, even if the survival of the culture can be increased using improved culturing methods, some of the relevant function and signaling pathways of the cells can be lost (Hartung, 2007a). For example, hepatocyte cultures are relevant, not only to predict one of the most common types of toxicity observed when testing chemicals and drugs, but also to properly predict metabolism and pharmacokinetics -key parameters necessary to properly forecast repeated dose toxicity. Several efforts have been made in improving this relevant cellular system and, at present, the cultures can survive for several weeks. However, during this time the cultures lose their metabolizing capacity (Miranda et al., 2009) and therefore lose value in predicting for repeated dose toxicity. A third limitation, which may also be seen as an opportunity, is that these systems will be predictive (potentially) of animal toxicity, rather than human toxicity, and assessment factors for interspecies variation would still be necessary in the final risk assessment (Falsig et al., 2004a;Lund et al., 2006). Since the goal is to predict human toxicity, continuing to use animal systems may not be ideal. However, these animal systems may be seen as an interim step in the process of full animal replacement, in that it will be easier to validate/assess the value of these tests by comparing the in vitro results to already available in vivo data from the current animal tests, and these tests may be easier/quicker to develop than the respective human systems.
Human primary cultures are perhaps the most relevant system for in vitro screening from the standpoint of species specificity and maintenance of the optimal genetic profiles and signaling pathways. However, a major disadvantage of human primary cultures is the poor availability of human samples (often derived from cadavers or cancer patients), resulting in little control over the phenotypes selected for screening. An alternative to primary cultures is immortalized human cell lines. While these cells have the advantage of being easy to culture and the ability to increase screening throughput, they may have altered signaling pathways, and in some cases their metabolism is changed (more glycolytic energy generation). In most cases they lose xenobiotic-metabolizing capacity (Hartung, 2007a). Conditionally-immortalized cells, or cells in which the immortalizing transgene can be deactivated, may offer a compromise solution (Lotharius et al., 2005;Scholz et al., 2011).
Stem cells can be classified into three major categories, according to derivation: embryonic stem cells (ESC), adult stem cells (ASC) and induced pluripotent stem cells (iPSC). Adult stem cells (ASCs) comprise, e.g., Mesenchymal Stem Cells (MSCs), are present in somatic tissues and have characteristics of multipotent adult progenitor cells. They are not able to differentiate into all cell types of the organism. However, it should be taken into consideration that both bone marrow mesenchymal stem cells (BMSC) and adipose-derived mesenchymal stem cells (ADMSC), when properly differentiated, have potential for hepatic and neuronal differentiation (Banas et al., 2007;Gimble cell-systems that constitute them will be important in predicting complex toxicities.

Endpoints
The main toxicological endpoints for in vitro technologies have been the classical markers for cell death, such as membrane permeability, intercellular energy levels, glutathione levels, and other general endpoints that represent a high level of toxicity. However, it is important to remember that toxicity is first induced by the malfunctioning of cells, from which significant cell dysfunction and death follow, i.e., if we consider compounds that act on the cytoskeleton or on exocytosis, we must consider cytoskeletal component alteration or enzyme release as a significant endpoint, rather than just cell death. In this way, the substitution of classical "toxicological" endpoints with functional ones is a way in which classical toxicity prediction could be improved, especially when predicting organ-specific toxicities. For example, compounds that are cardiotoxicants often are found to be cytotoxic in hepatocytes or other types of cell cultures. This information may be useful for acute toxicity but less relevant for the assessment of organ-specific toxicity in the heart. In this case, it is more relevant to consider the contractive capacity of cells, rather than the induction of apoptosis.
Another good example of the concept of more specific toxicological endpoints improving the overall quality of in vitro testing in general is that better in vitro ADME (absorption, distribution, metabolism, and excretion) prediction significantly reduced the attrition percentage for compounds in development in the pharmaceutical industry (Kola and Landis, 2004). Specific studies of ADME-related mechanisms led to the development of good predictive systems with the most indicative endpoints. Recently, more investment has been made in developing new approaches to investigate more sophisticated and meaningful endpoints. High-content screenings, platforms for biomarker detection, TaqMan Low-Density Arrays, and new technologies for the assessment of phosphorylated proteins are examples of technologies that allow the investigation of a wider variety of toxicological pathway endpoints.
In vitro exposure Kinetics and biodistribution are two key factors that must be included in the evaluation of repeated dose toxicity. However, in vitro screenings often do not consider the actual (as opposed to nominal) in vitro concentration, bioavailability, and degradation of compounds. Frequently, synthesized compounds are not stable at 37°C and/or bind to plastics or media proteins, factors that often are not considered or accounted for in in vitro tests. The importance of this cannot be overestimated, as it is crucial for data interpretation. Although it is labor intensive, the detection of the real free concentration and measurement of the stability and availability of the compounds in an in vitro system cannot be neglected.

Assay validation
Not all parties were in agreement with the timelines envisioned in the EC's expert panel report (Adler et al., 2011). Some claim stellate cells, and cholangiocytes. Multiple cell types in each organ system are undoubtedly important in various types of toxicity, so a wider variety of cells in the in vitro system could provide a better picture of potential toxicity. c) Stem cells can represent genetic diversity. This is particularly true if induced pluripotent stem cells (iPSC) are used (see note on types of stem cells above). d) Under the appropriate culture and assay conditions, the throughput and predictivity of in vitro assays using cell culture would be increased significantly through the use of stem cells. However, the limitations of the system should not be neglected. Stem cell biology is a young science, and so far the culture of stem cells is not trivial. Additionally, when stem cells differentiate into different cellular systems, the differentiation rarely occurs in 100% of the population, and not all of the cells are in the same stage of full differentiation (Ameen et al., 2008). For example, hepatocytes, when properly differentiated to produce hepatic endoderm cells or hepatocyte-like cells, present characteristics of fetal hepatocytes and do not express fully active cytochrome P450 signals (Greenhough et al., 2010). Similarly, stem cell-derived cardiomyocytes resemble human heart tissue but variably and with gene expression that is not the same as in adult heart tissues, indicating that additional differentiation protocols are needed (Asp et al., 2010).
So far, only a limited number of cell types have been differentiated, compared to the variety of potential cell types within an organism. Limited phenotypes and functional data are available for the embryonic stem-derived cells, with few exceptions. More research and investigation is needed to determine the state of maturation and functionality of the different cellular types. When these characteristics can be verified, the possibility of applying in vitro stem cell-derived models in predictive strategies in toxicology will increase dramatically. This is not possible in the next couple of years, but, based on the data available so far, it may be possible in three to seven years.

Specific considerations for in vitro methods
The following constitute special issues that must be borne in mind during the development, validation, and implementation of cellular test methods:

Culture methods
The application of techniques such as 3D culture systems and co-culture has great potential for toxicity testing. Several examples confirm the relevance of 3D culture models to improve the structure and the prediction rate, not only for toxicity but also in screening for pharmacological assays (Dash et al., 2009;Lan and Starly, 2011;Meng, 2010;Nakamura et al., 2011;Toh et al., 2009). Similarly, co-culture methods will give a better idea of the relevance of interaction and crosstalk between the different cell types. Co-culture systems have been proven particularly relevant in prediction of inflammatory effects and the physiological interaction between signaling pathways (Boraso and Viviani, 2011;Scharf et al., 1996;Tukov et al., 2006). The further development of these methodologies in concert with the

Conclusions and recommendations: repeated dose toxicity
It is most likely that a decade or more will be required before the gaps can be appropriately filled. That said, the authors recommend a step that can be taken immediately: the implementation of more stringent and appropriate ITS for testing. Another step that could move forward immediately, and that could improve ITS testing schemes, as well as in silico and in vitro technologies, is a frank and complete gathering and assessment of the repeated dose toxicity data that already exists for a wide variety of compounds, and the use of these data for case studies investigating the needs and pitfalls for new assays. This would require the collaboration of a variety of entities, including commercial, governmental, and non-governmental. The benefits that could be reaped by such a concerted effort in data gathering and sharing clearly outweigh the difficulties. In order to improve the predictivity of current in vitro and in silico tests, and even the current tests, the identification of pathways of toxicity must continue to move forward. This requires the use of new technologies in the field of omics and systems biology, combined with new cell models and evaluation strategies based on chemical inhibitors or gene inactivation. One particular issue that must be addressed is the setting of guidelines as to what constitutes an appropriate model for each organ system, i.e., what makes a heart a heart, a liver a liver, etc. In this context, it will be important to consider how immunological and inflammatory reactions can be incorporated in such organ systems.

Recommendations: repeated dose toxicity
The following steps are suggested to replace animal testing for repeated dose toxicity in an appropriate and timely fashion: 1. Joint task force: A joint effort toward a toxicity database to gather all current data on a wide variety of compounds would greatly improve the quality and speed of new test development and validation. Organization of this effort should begin immediately. The data should be used to support case studies designed to identify test requirements and pitfalls, as well as for test evaluations.

Tiered testing systems and decision trees (ITS):
Although it clearly is not yet possible to replace in vivo testing completely, we can refine and reduce the number of animals used today. Implementing decision trees, tiered approaches and/or applying screening strategies is possible immediately, and these can be modified as more and more non-animal tests become available. In addition to existing animal data, data from in vitro tests and data from in silico systems, as well as human data (epidemiological and clinical/medical), can be integrated into these types of approaches. These data should not be ignored! The time to act on this is now, for all types of compounds.
that the date of 2013 for animal replacement is still possible (Balls and Clothier, 2010;Spielmann, 2010;Taylor et al., 2011). Taylor and Casalegno, in particular, claim that several alternative methods are available where the percentage of prediction is above 80% (Carfi et al., 2007;Duff et al., 2002;Huang et al., 2009;Inoue et al., 2007;Langezaal et al., 2002;Pessina et al., 2001). Although these are all very promising examples, seldom more than ten compounds were tested in these assays. They must be more appropriately validated with a larger number of compounds, while still achieving a high percentage of prediction to be more universally accepted. To that end, it is worth mentioning that a good validation, particularly for the complex endpoints of repeated dose toxicity, should include a sufficient number of compounds, ideally representing a variety of classes. To increase the number of classes of compounds that can be used for validation of common tests, again, collaboration between different industries and entities is the ideal. Each validation must be tailored to the system being tested, and certain agreements must be set for all tests for a certain type of toxicity (Hartung, 2007b). For example, for organ-specific toxicities using in vitro cellular assay tests, it must be decided "What constitutes a heart?" and "What constitutes a liver?" More broadly, what cell types, gene expression, and physiological markers must be set in order for a system to appropriately represent the organ in question? Thus, comparison directly to current endpoints and markers may be necessary at first, but a true assay validation must be tailored to the test or testing scheme in question, particularly for repeated dose toxicity.

In silico prediction
The value of bioinformatics, in silico technologies, and systems biology in analyzing the data, identifying new pathways, and predicting toxicity is inarguable. Many of the aforementioned reports and reviews on the replacement of animal tests summarize the state of the science for in silico methodologies for repeated dose toxicity testing, so we will not provide a summary here. However, as we work toward the goal of in silico models and methodologies as a key part of toxicity testing, it is of extreme importance to recognize that the quality of the data used to create predictive in silico models significantly affects the quality of the system itself. If low quality data are used, the system is designed to fail. When designing in silico methods using in vivo data, it is vital to have data from well-designed experiments that indicate the time course of the toxicity and that will correlate pathology with molecular and mechanistic endpoints. If in silico methods are developed on the basis of in vitro data, the quality and predictivity of the experiments become even more important. For example, basing an in silico model for pathway analysis on data from tumor cell lines would be suspect, since these cell lines often have altered signaling pathways. Another example is the use of RNAi data: it is essential that the appropriate cell line was used to derive the data, and only the pathway in question was affected by the interference. eventually can be replaced with simple assays, as in the ToxCast program. b. More sophisticated methods will probably decrease the throughput, but, at present, they will most likely provide more long-term and stable systems. They may, for the foreseeable future, be better at predicting more complex organ toxicity (e.g., 3D-systems and co-cultures), particularly inflammatory and fibrotic processes. c. Appropriate endpoints must be chosen for each test and test system: what do we want to know and what toxicity are we trying to predict? Omics approaches will get rid of this problem, as many endpoints can be evaluated simultaneously (Henn et al., 2009). d. Real free concentration and stability of the compounds during the exposure in vitro is of major relevance for evaluating the actual toxic dose. Overall, the modeling and prediction of compound concentrations will play a key role for QIVIVE. e. For the complex models of biological processes, a significant number of known positive and negative compounds are required to evaluate the performance of the system. The selection of compounds should consider the applicability domain and different chemical classes, as well as modes of actions. The creation of a reference list of compounds for which information on mechanisms of toxicity and potency is readily available would speed the validation process immensely for all new testing systems.

Considerations for the development and validation
of in silico models: It is extremely important to be sure of the quality of data used to build in silico models. Specific criteria to evaluate the robustness and quality of the experimental data used in the development of in silico models should be developed and agreed upon in order to address this issue and to assist in design and validation of high quality in silico models.
3. Understand signaling pathways: Understanding the molecules and pathways involved in toxicological events is crucial for progress in toxicology. This is probably the most important activity for future success in replacing animals for RDT testing. We should consider: a. The signaling pathways involved in toxicity may be normal signals that are altered in the duration or magnitude of response. Therefore, a quantification of the signal is of great relevance. For this reason, two different concepts are followed initially. The identification of PoT is a more long-term goal. An ITS based on high-throughput mapping of PoT and their disturbance in simple systems may eventually yield a good toxicity prediction. In the meantime, while not all quantitative relationships of the network of PoT are known, and while it is still unclear why chemicals affect one cell type more than another, more complex systems will be employed to arrive at more apical endpoints (Zimmer et al., 2011). The two approaches will be complementary and require a parallel development for some time. b. Tools such as RNAi or chemical interference, which are often implemented to aid in understanding signaling pathways in various diseases, could help toxicologist understand the signaling pathways involved in toxicity.

Considerations for development and validation of in vitro systems:
A large number of potentially useful in vitro cellular assays are available, and each of them has advantages and disadvantages. It needs to be considered that: a. All in vitro systems have limitations, and the choice of which to use will depend on the question asked. This is particularly important in the nearer future, with the use of complex test systems. Only these experiments, and comparison with high-throughput approaches, can show whether the complex systems to inform the general public about the risks that chemicals may pose have served us well. We believe it is likely that revamping our testing paradigms by basing them on updated and rigorously tested science and leaving the precautionary aspect explicitly to the risk management process would better serve both those involved in carcinogenicity testing and the public.

Introduction: carcinogenicity
In April 2010, the US President's Cancer Panel published the report "Reducing Environmental Cancer Risk" (Reuben, 2010). Although the report acknowledges that "overall cancer incidence and mortality have continued to decline in recent years" (see also Fig. 5.1), it states "the true burden of environmentally induced cancer has been grossly underestimated. With nearly 80,000 chemicals on the market … un-or understudied and largely unregulated, exposure to potential environmental carcinogens is widespread." This situation must be considered in the context that life expectancy has tripled (Kirkwood, 2008) during the period in which these chemicals were introduced.
At the same time, the possible health risks posed by chemicals are of considerable concern to the general public (Entine, 2011), which fuels the demand for safety testing of chemicals. Surveys conducted by Eurobarometers in 2005 and 2010 asked Europeans the question of how likely they consider the possibility that environmental chemicals damage their health. In both years, around 18% of respondents considered this "very likely" and 43% "fairly likely" (Eurobarometer 73.5 from 06/2010 and 64.1 from 09-10/2005). In strong contrast, the degree of contribution of chemical exposure to the overall cancer rate has been estimated at only 4% for occupational exposure, 2% for pollution, less than 1% for industrial products, and 1% for medicines and procedures (Doll and Peto, 1981). These estimates, however, are outdated and, for example, did not take into account the interactions of multiple factors.
It is not the purpose of this paper to take a position in any of these debates but rather to address the issue of how to best test chemicals for carcinogenic potential, given the potential of these chemicals to exert health effects. At the same time, we have to ask ourselves whether traditional precautionary methods used

Fig. 5.1: Cancer mortality in the US over time
Annual age-adjusted cancer death rates among males and females for selected cancers, US 1930-2006. Adopted from (Jemal et al., 2010). Rates are adjusted to the 2000 US standard population. Due to changes in International Classification of Diseases (ICD) coding, numerator information has changed over time. Rates for cancers of the lung and bronchus, colon and rectum, and liver are affected by these changes.
It is important to note that carcinogenicity testing was developed as a result of historical cases of adverse effects, and the test models currently in place were developed with the existing knowledge at that time. However, the fact that there has been much scientific progress relevant to this field since then, combined with the degree of public concern about potential chemical carcinogenicity, has led us to focus this paper on carcinogenicity testing.

Standardization of protocols
The cancer bioassay is astonishingly young, given the importance of the health effect in question: the standardized protocol was suggested by the US National Cancer Institute in 1976 and adopted by OECD in 1981. The ICH (International Council on Harmonisation of Technical Requirements for the Registration of Pharmaceuticals for Human Use) only adopted the test for use in pharmaceuticals in 1997. contribute to cancer initiation and promotion. The potential of chemicals to interfere with repair and defense mechanisms, as well as detoxification and excretion, further contribute to this complexity.
An ideal carcinogenicity testing system would take all of these factors into account. Unfortunately, such a system does not exist. In this paper, we assess the available tools for carcinogenicity testing, introduce emerging tools that could transform this testing paradigm, and discuss the potential we see for these novel methodologies.
Definition of carcinogenicity 1 : "Chemicals are defined as carcinogenic if they induce tumors, increase tumor incidence and/ or malignancy, or shorten the time to tumor occurrence. Benign tumors that are considered to have the potential to progress to malignant tumors are generally considered along with malignant tumors. Chemicals can induce cancer by any route of exposure (e.g., when inhaled, ingested, applied to the skin, or injected), but carcinogenic potential and potency may depend on the conditions of exposure (e.g., route, level, pattern, and duration of exposure)."

Application of the framework to carcinogenicity testing
We have applied the assessment framework presented in Chapter 1 to analyze various options as potential alternatives to the cancer bioassay (OECD TG 451; OECD, 2009), which is conducted as a 2-year bioassay in rats and mice and is currently the only accepted test for carcinogenicity.
Testing with the cancer bioassay in two species can involve 600-800 animals, the histopathological examination of more than 40 tissues per animal, and costs approximately € 1 million per chemical and species (Vanparys et al., 2011). This bioassay is obviously time-consuming and expensive, and uses large numbers of animals. In addition, the assay's predictivity for humans has been challenged (Knight et al., 2006a,b,c;Shanks et al., 2009). Thus, while protection against potential carcinogenic effects of environmental chemicals is a key desire of the public, this assay is not suitable for broad use, nor is it broadly used.

Abolition of useless tests
The concept that genotoxicity is the first and foremost mechanism of chemical carcinogenicity is rarely challenged. However, there are little or no epidemiological data that support the hypothetical existence of widespread chemical carcinogenesis. Not only has average age increased continuously over the last 150 years (Kirkwood, 2008), during which period about 100,000 chemicals were introduced into our environment, but age-adjusted cancer rates did not increase over this time period (Jemal et al., 2009). Furthermore, exposure to mutagens did not correlate with oncomutations in people (Thilly, 2003).  Oliveira et al., 2007) appears to be a most critical factor. Even when the same strain is used, there appear to be problems with standardization that hamper the use of historical control groups (Haseman et al., 1997). In this study, the most commonly used strains showed strong weight gain and changes in some tumor incidences that resulted in reduced survival over just one decade, which was attributed to the intentional or inadvertent selection of breeding stocks with faster growth and easier reproduction. Other factors that have been suggested that could possibly influence the bioassay protocol over time include caging protocols, diet, environmental factors, genetic drift, study duration, and survival differences.
An analysis of 1,872 individual species/gender group tests in the US National Toxicology Program (NTP) showed that 243 of these tests resulted in "equivocal evidence" or were judged as "inadequate studies" (Seidle, 2006), suggesting the protocol as it stands is not robust. The two-species paradigm also has been challenged (Alden et al., 1996) by studies showing that rats are more sensitive, and regulatory action is rarely taken on the basis of bioassay results in mice (Van Oosterhout et al., 1997; Although it is in many respects a well-standardized protocol, it has been criticized as having poorly defined endpoints and a high level of uncontrolled variation. Suggestions for aspects of the test that could be optimized include proper randomization, blinding, better necroscopy work, and adequate statistics (Freedman and Zeisel, 1988). However, 20 years after its adoption by OECD, the most recent test guidelines (OECD, 2009) still do not make randomization and blinding mandatory, and the guideline statistics do not control for multiple testing, despite the fact that about 60 endpoints are assessed in the assay. Furthermore, the data analysis is ill-defined: "When applicable, numerical results should be evaluated by an appropriate and generally acceptable statistical method." Reducing the duration of the assay to 18 months has also been suggested (Davies et al., 2000), although others contradicted the applicability of this option (Haseman et al., 2001).
In addition, the assay has not been standardized for animal strains, with the only definition being that "young healthy adult animals of commonly used laboratory strains should be employed." This is contrary to evidence that strain standardization  Oliveira et al., 2007) But doses that are hundreds to thousands of times higher than normal exposures (such as those often given during animal testing) might be carcinogenic simply because they overwhelm detoxification pathways. In these cases, we see tumors along with gross histopahologic evidence of tissue damage." However, dose regimens are defended by others (Bucher, 2000), and many substances test positive for carcinogenicity only at maximum tolerated doses, including some accepted human carcinogens. These results also might be interpreted as species differences that are hidden by high-dose artifacts at the expense of many false-positives.

Predictivity of point of reference (human cancer)
An analysis by Pritchard et al. (2003) suggested 69% predictivity of human carcinogenicity for the two-species cancer bioassay, which ironically dropped to 65% when it was combined with in vitro genotoxicity test findings (Pritchard et al., 2003). This contrasts with an analysis by Knight et al. (2006a,b), who showed that in 58% of cases considered by the EPA, they deemed results from a positive cancer bioassay as insufficient for assigning human carcinogenicity, even though the EPA was far more likely to assign this classification than the IARC. A previous comparison of known human carcinogens, as classified by the IARC mainly based on epidemiology, with corresponding animal data found an unconvincing correlation (Freedman and Zeisel, 1988): "The research reports of the cancer community (even taken at face value) do not sustain the conventional argument for the validity of the qualitative extrapolation ... We remain sympathetic to the idea that animal data have some predictive value for carcinogenicity in humans ... But the evidence for such propositions is surprisingly weak." It is also worth noting that the most typical sites of tumor formation in humans do not correspond to those in rodents (Anisimov et al., 2005), as shown in Table 5.1: Ravenzwaay, 2010). It is estimated that $ 1-2 million and up to 1,000 mice over a 3-year period would be saved by eliminating the mouse section of each chemical test (Alden et al., 1996).

Reproducibility
Gottmann et al. (2001) compared 121 replicate rodent carcinogenicity assays from the two sections (National Cancer Institute/National Toxicology Program and literature) of the Carcinogenic Potency Database (CPDB) to estimate the reliability of these experiments. They found a concordance of 57% between the overall rodent carcinogenicity classifications from both sources; this result did not substantially improve when species, sex, strain, and target organ information was considered. They concluded: "These results indicate that rodent carcinogenicity assays are much less reproducible than previously expected, an effect that should be considered in the development of structureactivity relationship models and the risk assessment process." Ironically, cell transformation assays (CTA, discussed in more detail below) appear to reproduce the cancer bioassay better than it reproduces itself. Thus, it appears likely that the existing bioassay would fail any validation investigation that a replacement test would be subjected to.

Potency correlation between species
This is not a classical validation criterion, but it is part of the Bradford-Hill criteria to support associations. The apparent correlation between potency of carcinogens in mice and rats has been shown to be largely an artifact (Bernstein et al., 1985).

Interspecies and organ site correlation
Concordances of 57% were reported between mouse and rat bioassays. Better correlations that were previously reported (71% rat to mouse, 76% mouse to rat) were driven by the abundance of strong mutagens studied, which are typically positive in all sexes, many species, and several organs (Gray et al., 1995). An analysis of bioassays in rats, mice, and hamsters, as well as comparisons with humans for known carcinogens, has shown that the likelihood of a chemical that induces tumors in one species in a certain organ also inducing tumors in another species in the same organ is less than 50% (Gold et al., 1991(Gold et al., , 1998.

Sex specificity
A critical appraisal of the role of sex hormones (endocrine status) on species susceptibilities in chemical carcinogenesis (Toth, 2002) concluded: "There are compelling indications, particularly in the fields of physiology and metabolism, to conclude the limited usefulness of the various animal species in sex hormone research. The findings allow only restricted inferences for the human species."

Scientific relevance
The first critical issue is that of high-dose to low-dose extrapolation. The use of maximum tolerated doses appears to be the source of many artifacts. Jay Goodman, Michigan State University, is cited (Schmidt, 2002) as saying: "If we're dealing with a situation in which the likely human exposure is in the same ballpark, then these (dosing regimens) may be applicable.  (Davies and Monro, 1995). It is not known how many chemicals were rejected over the same period (Davies and Monro, 1995). An early analysis of 20 putative human non-carcinogens found 19 false-positives, suggesting only 5% specificity (Ennever et al., 1987). The inappropriateness of rodent carcinogenicity assays as currently performed has been examined by Roe (1987), who notes that: "There can be no sense in testing chemicals for carcinogenicity in rats maintained under conditions such that 50-100% of them (the control animals) develop pituitary and mammary tumors, etc. There is no identifiable population of humans for which such rats could constitute a model." The implications of these observations for risk assessment have been noted by Bridges (Bridges, 1988). However, others see even this as an underestimate (Sobels, 1987): "... Carcinogenicity is expressed to a different extent in different species of rodents, so that bioassay results in only two rodent species are likely to underestimate the proportion of chemicals with carcinogenic potential."

Sensitivity
Assessing the sensitivity of the cancer bioassay is made difficult by the fact that most human carcinogens were designated as such, to a large extent, by animal tests (with the discussed problematic interspecies correlation), and those typically missed are not identified by other means. There are strong claims that all known human carcinogens are detected with the cancer bioassay (Huff, 1999;Rall, 2000), but this could be considered a self-fulfilling prophecy, as most of these classifications are based on animal experiments. However, not all known human carcinogens can be modeled in animals (Silbergeld, 2004). For example, there is -no animal model of cigarette smoke-induced lung cancer, -no rodent leukemia induced by benzene, and -no genetic point mutations in animals induced by arsenic. This situation does not necessarily represent a contradiction, as these agents are positive for carcinogenicity in other organs or by other modes of action. However, achieving the right classification but for the wrong reason is a questionable outcome. Furthermore, the current testing situation leads to an enormous number of false-positives; Rall suggests that only one in ten compounds is truly carcinogenic (Rall, 2000). Despite all of these false-positives, cases of human carcinogens that are not detectable in animals remain, e.g., the anticonvulsant diphenyl-hydantoin (phenytoin) is classified as carcinogenic to humans but showed no carcinogenic effect in experimental mice and rats (Anisimov et al., 2005). Ennever and Lave (2003) also have discussed the chemical combination of aspirin/phenacetin/ caffeine, which is classified as a human carcinogen but tested negative in both rodent species (Ennever and Lave, 2003). Johnson (2001) presents a list of the known human carcinogens that have been tested in the NTP rat bioassay prior to 2000 (Johnson, 2001): "The list contains 10 different chemicals, counting the various forms of asbestos as one, the three nickel compounds as one, and the 10 benzidine-like compounds as one ... (Of) the 13 individual chemicals tested in four sex-species groups, two chemicals were positive in four groups, one was positive in three groups, six were In the absence of human data, it might be considered reasonable to use data from tests in nonhuman primates for comparative purposes. Cancer bioassays in nonhuman primates were carried out on 37 compounds within 34 years (Takayama et al., 2008); the results were "... Inconclusive in many cases," but carcinogenicity was shown unequivocally for four of them.

Specificity
About 50% of all chemicals tested in the cancer bioassay test positive (see Tab. 5.2), and 53% of 301 chemicals tested by the NTP were positive, with 40% of these positives classified as non-genotoxic (Ashby and Tennant, 1991). It is sometimes claimed that this high positive rate is due to the testing of suspicious substances, especially in early years of identification of mutagens. Of substances tested in the NTP simply because of exposure considerations, 80% were found not to be carcinogenic (Fung et al., 1995). In contrast, Johnson identified 60% of 128 high production volume chemicals to be rodent carcinogens (Johnson, 2003). A similarly high proportion, around 50% positives, can be found in various databases for pharmaceuticals (MacDonald, 2004).
Pharmaceuticals are rapidly discontinued when they are found to be possibly genotoxic, but also many non-genotoxic ones test positive in the cancer bioassay (Silva Lima and Van der Laan, 2000). "The database compiled from the 'Physician's Desk Reference' (PDR), including registered pharmaceuticals only, also provides a good illustration of rodent tumor findings being irrelevant to humans" (Davies and Monro, 1995;Silva Lima and Van der Laan, 2000). Over two decades, 101 out of 241 substances entered the market despite positive cancer bioassays, presumably primarily as a result of the positive bioassay Tab. 5.2: Proportion of chemicals evaluated as carcinogenic (modified from Ames and Gold, 2000;Gold et al., 2005)  forts (Schmidt, 2002) resulted in a discussion of whether this is "bashing the cancer bioassay" (Johnson and Huff, 2002). While possibilities for improving the animal test are outside the scope of this paper, the discussion shows the difficulties of using the cancer bioassay as point of comparison.
These comments, taken together, indicate that the cancer bioassay -although it has never been formally assessed -appears to have severe limitations. Furthermore, the assay would not stand up to the assessment criteria that any potential replacement test would have to fulfill. However, these limitations are not fully understood by many who use the assay for validation of alternative methods or regulatory purposes.
It appears timely to address these limitations before embarking on the expensive process of developing and validating replacement strategies that would only then be measured against this wrongly-considered "gold standard" test. It might be debated whether this represents a case for formal invalidation (Balls et al., 2006;Balls and Combes, 2005), but an approach based on the principles of evidence-based toxicology (Hartung, 2010c) seems to be more appropriate in this scenario than formal validation. A formal assessment of the assay would allow widespread dissemination -and encourage acceptance -of the evidence for the assay's limitations.
In line with these suggestions, the REACH guidance by ECHA is already quite cautious in its recommendations for use of the cancer bioassay (ECHA, 2008): "A carcinogenicity study may, on occasion, be justified. If there are clear suspicions that the substance may be carcinogenic, and available information (from both testing and non-testing data) are not conclusive in this, both in terms of hazard and potency, then the need for a carcinogenicity study should be explored. In particular, such a study may be required for substances with a widespread, dispersive use or for substances producing frequent or long-term human exposures. However, it should be considered only as a last resort."

Reduction to key events
This approach aims to replace in vivo testing with stand-alone alternative methods. Carcinogenicity traditionally is seen as the combination of genotoxicity leading to mutation and subsequent promotion of the mutation. The development of an in vitro genotoxicity test battery would be aimed at reducing these hazards to key events. This rather simplistic view, and the testing approach that would result, is unconvincing, as many mutagens are not carcinogenic and substances do not exist in isolation in real life; rather, people are exposed to complex mixtures of substances, including incomplete carcinogens, such that there are situations in which compounds that are either only genotoxic or only promoting complement one other. Furthermore, there is increasing evidence that many modifying factors influence organotropy, growth rate, metastasis, resistance to immune reactions and treatment, etc.
The Ames test is the best standalone predictor of the rodent bioassay of the traditional genotoxicity test battery, with about 60% sensitivity (Kirkland et al., 2005). Earlier, Gold et al. (1998) reported that out of 465 chemicals, 45% were found to positive in two groups, one was positive in one group, and three were positive in no (0) groups. Only two human carcinogens (thiotepa and benzene) are bona fide trans-species carcinogens. Thus, for NTPRB-tested chemicals, it is not evident that human carcinogens necessarily demonstrate clear trans-species carcinogenic effects." These examples clearly contradict claims of 100% identification of known human carcinogens. It is also worth noting that an early assessment of the bioassay suggested only 46% sensitivity based on 19 human carcinogens (Salsburg, 1983).
The fact that rats and mice predict each other only about 57% does not fit with an assumption that 100% of human carcinogens are detected, as it is fair to assume that humans are not better predicted by either species than they predict each other. These figures of 57% concordance between species, 10% real human carcinogens, and 53% positives in the rat, combine to give a sensitivity of 100% with a specificity of 47%. Lave et al. previously arrived at an estimate of 70% sensitivity as well as specificity, assuming 10% real human carcinogens (Lave et al., 1988).
If the same calculation is performed with the assumption that 20% of all chemicals are carcinogenic in humans, this results in 75% sensitivity with 53% specificity. Interestingly, when we use the suggested 28% positives in rat, if non-suspicious chemicals are tested the result is 0% sensitivity and 65% specificity. Thus, whatever assumption is used, the assay does not perform well by any standard.
In a telling modeling exercise, Gaylor (2005) showed that increasing the number of animals per group from 50 to 200 would result in statistically significant (p<0.01) dose-responses for 92% of substances tested (Gaylor, 2005). This shows how the inherent variability of the test produces false-positives and reduces specificity using the current data analysis process.

Applicability domain
An applicability domain, i.e., the part of the chemical universe where the cancer bioassay is applicable to make sufficiently correct predictions, has not been established for the rodent cancer bioassay. Occasional reference is made to a better prediction of (strong) genotoxic substances, but these substances are exactly the ones filtered out by the in vitro genotoxicity testing battery and are unlikely to be tested in the bioassay.

Performance standards
Performance standards have been introduced for test methods as a guide to demonstrating that a given variant of a test is equivalent to the originally validated test. No such performance standards exist for the bioassay, although they would be very helpful for evaluation of alternative test methods. Bucher reported a discordance rate of 13/38 for the transgenic approach (Bucher, 1998), or a level of agreement of 68%, which barely differs from the 65% shown by the Salmonella mutation test. In response, Johnson showed the fingerprint pattern of organ sites affected (Johnson, 1999), concluding "... It seems unlikely that transgenic models could ever replicate or faithfully emulate the carcinogenic response observed in natural whole animals." More extensive evaluations were conducted by an ILSI/ HESI committee (Cohen et al., 2001). An article on these ef-included in a ring trial is extremely small, and the complexity and duration of the protocol results in transferability and reproducibility issues. A feasibility study conducted to assess some aspects of reproducibility showed that the CTA could be used for decision making when combined with retrospective assessments of its predictive value, as made possible by a modular approach to validation . This points out the need to transition to novel types of objective assessments (Hartung, 2010c). However, such retrospective analysis of existing data requires a level of transparency of the process that typically is not provided.
The CTA represents a prime opportunity to replicate results of the traditional animal-based approach by reducing the tests to a key event. A recent evaluation based on 141 studies showed that the SHE-7 variant of CTA predictions of rodent carcinogenicity gave a sensitivity of 88%, specificity 77%, accuracy 85%, positive predictivity 89%, and negative predictivity 75% (Benigni and Bossa, 2011). More importantly, the detailed review paper of OECD 2006 indicated for the CTA a sensitivity of 90% of class I (known human carcinogens) and 95% of class II (possible/probable human carcinogens) (Long, 2007;OECD, 2007).
The CTA can and must undergo further optimization with regard to: -transition to human cells -addition of metabolic competence -automation, especially of foci reading -possible transition to biomarkers of cell transformation measurements -statistics (Ponti et al., 2007) The CTA also represents an interesting opportunity for pathway of toxicity (PoT) mapping, as discussed below.
A focus on key events, as just described, could also be applied to non-genotoxic mechanisms of carcinogenicity, such as immunosuppression, inflammation, and hormonal activity (Tab. 5.3). This corresponds to some extent with the type of information currently considered in weight of evidence approaches, especially in REACH, but it is more likely to form the basis be mutagens by the Ames test, 63% were carcinogens, and 72% were either, i.e., 79% of mutagens were carcinogens, but 43% of carcinogens were not mutagens, and 25% of the non-carcinogens were mutagens. When genotoxicity assays are combined, sensitivity inevitably increases but specificity falls: When all three tests were performed, 75-95% of non-carcinogens gave positive (i.e., false-positive) results in at least one test in the battery (Kirkland et al., 2005). For marketed drugs (which typically exclude substances found to be genotoxic during development), no particularly strong concordances were seen between the 29% positive for genotoxicity and the 38% with positive or equivocal findings in the cancer bioassay (Snyder and Green, 2001). These results raise strong questions as to whether such tests, alone or in combination, can really help to determine the carcinogenic potential of substances.
Cell transformation Sachs, 1963, 1965), i.e., lack of growth inhibition in the case of confluency, has been suggested as a key event reflecting mutagenicity and some initial effects on cell replication, reduced apoptosis, DNA repair, oncogene activation, suppressor gene inactivation, and epigenetic effects. The value of these assays has been long discussed (Combes et al., 1999), leading to parallel test guideline development at OECD and prevalidation at ECVAM (Vanparys et al., 2011). A key piece of the validation exercise was the detailed review document (DRP) prepared by OECD summarizing existing experience with the assay and some additions to this dataset (Mascolo et al., 2010;Mazzotti et al., 2002). The data presented in the DRP were considered at an OECD Expert Consultation Meeting in 2006. Overall, it was concluded that the SHE and BALB/c 3T3 assays had a strong ability to detect rodent carcinogens, with a good positive and negative predictive capacity and sensitivities and specificities in the 80% range. Unfortunately, detailed information on the validation, its peerreview by ECVAM's Scientific Advisory Committee (ESAC) (with only the statement now available for public discussion), and the subsequent conclusions of OECD are still not available (according to personal communications it was decided end of 2011 to proceed with the OECD guideline for the SHE assay but not for the Balb/c 3T3 assay), although a recent review sheds some light . A parallel Japanese validation study on an improved assay has been published in the meantime (Sakai et al., 2011).
Perhaps the greatest concern, however, relates to the lack of understanding of the mechanisms by which CTAs operate (Ashby, 1997;Farmer, 2002). It is puzzling, for example, how the 3T3 variant of the assay, which has limited metabolizing capacity (Colacci et al., 2011), can so well reflect in vivo carcinogenicity in rats, while activation of xenobiotics to form reactive substances is considered a key event for genotoxicity. The question, therefore, might be turned around: given the high false-positive rate of the cancer bioassay (as discussed below), does the CTA actually reflect the false-positives of an organism overwhelmed with maximum tolerated doses, where metabolism contributes little more?
The CTA validation shows the limitations of traditional validation studies. With costs of about € 15,000 per substance tested in one laboratory, the number of chemicals that can be

CYP450 induction
This approach is meant to be precautionary, but is it sufficiently accurate? Certainly, we have to call it prohibitive, as it excludes large parts of the chemical universe from many uses. A debate intended to improve genotoxicity testing has started (Goodman et al., 2007;Lorge, 2009), with the aim of also incorporating Tox-21c approaches and new technologies (Elespuru et al., 2009).
Other properties and tests might exclude carcinogenic potential. The concept of "no penetration, no harm" offers some opportunities, for example. Large molecular weight typically is accepted as an indication of no harm, although fiber toxicity, as seen with asbestos and now, increasingly, with nanoparticle toxicology, might challenge this (Hartung, 2010e;Hartung and Sabbioni, 2011). The major problem in this approach is its reliance on negative data (no uptake). This concept is further refined by the threshold of toxicological concern (TTC) approach, where exposure in sufficient quantity, rather than bioavailability of integrated testing strategies (ITS) than stand-alone replacements. Notably, REACH guidance by ECHA lists a number of in vitro tests that add weight of evidence (Tab. 5.4).

Negative exclusion by lack of key property
The current use of the in vitro genotoxicity battery follows a negative exclusion approach, i.e., substances showing no genotoxic potential are considered of low carcinogenic potential. The limitations of this approach, namely a high false-positive rate of the combined use of these assays, are well known (Blakey et al., 2008;Kirkland et al., 2007) and addressed elsewhere (Benfenati et al., 2009;Kirkland et al., 2007;Pfuhler et al., 2009Pfuhler et al., , 2010a). It appears that the cancer bioassay produces far too many falsepositive results when compared to human hazards , the mutagenicity testing in vivo further adds genotoxic substances that are not carcinogens, and this is further aggravated by the over-predictive in vitro battery. hormone-or other receptor binding: a number of agents may act through binding to hormone receptors or sites for regulatory substances that modulate the growth of cells and/or control the expression of genes that facilitate the growth of neoplastic cells.

Tab. 5.4: In vitro tests adding weight of evidence for carcinogenicity assessments according to REACH guidance by
Interactions of this nature are diverse and generally very compound specific.

Apoptosis is integral to the control of cell growth and differentiation in many tissues. Induction of apoptosis can eliminate cells that might otherwise suppress the growth of neoplastic cells; inhibition of apoptosis can permit pre-neoplastic/neoplastic cells to escape
regulatory controls that might otherwise result in their elimination.
ability to stimulate angiogenesis or the secretion of angiogenesis factors: the growth of pre-neoplastic/neoplastic cells in solid tumours will be constrained in the absence of vascularisation to support the nutritional requirements of tumour growth.

Secretion of angiogenesis factors stimulates the vascularisation of solid tumour tissue and enables continued tumour growth."
optimization might be very different: for an ITS, the goal is not necessarily the best prediction or highest sensitivity for each test component, but value added in complementing the other blocks of the strategy. Many earlier developments might need to be revised when tests are now considered for ITS instead of standalone applications.
A number of strategies might be able to improve the predictive value of existing test systems: -extension of metabolic capacity -organotypic 3-dimensional (co)-cultures -more physiologic culture conditions such as homeostasis, oxygen supply, cell density -transition from cell lines to primary cells or stem cell-derived systems -use of human cells, preferably primary cells and possibly a battery of different human cell types, ideally derived from stem cells -human cells that have wildtype p53 and are DNA-repair competent -use of genetically stable cells -refinement and expansion of endpoints measured -restriction of maximally used concentrations -standardization and automation -quality assurance of procedures -appropriate statistics and prediction models -definition of applicability domains -better understanding of the mechanism of action -knowing the weaknesses and strengths of systems for the development of new models

In silico approaches
Due to the enormous costs involved, public interest, and the availability of in vivo data (especially from the NTP), carcinogenicity testing has been subjected to intense in silico modeling. For carcinogenicity prediction, however, the use of these models is rather limited. Benigni and Bossa summarized (Benigni and Bossa, 2006) We think the answer is at best 'to some extent'. The models used are not overly realistic for the purpose of data description, because they ignore essential processes." And (Benigni, 2004): "Study of the structure of the chemicals generates predictions with limited reliability for the individual chemicals" but Benigni sees enormous value for priority setting for testing. A key prerequisite for improving the available models will be to generate larger homogenous datasets for modeling (Patlewicz et al., 2003). An impressive discrepancy currently exists between studies employing external evaluations, such as the Predictive Toxicology Challenge (PTC), and internal validation results. For the PTC a training set of 509 compounds from the NTP was used, with results for carcinogenic effects (Helma and Kramer, 2003). or no bioavailability, is employed. Exposure issues, as formalized in TTC approaches, are limited by the no-threshold philosophy for cancer hazards, which is under continual discussion (Calabrese, 2009;Crebelli, 2000;Kirsch-Volders et al., 2000;Lutz, 2000;Morelli, 2000;Neumann, 2009;Rhomberg, 2011). However, for some genotoxic mechanisms, such as aneugenic activity leading to tumors, thresholds are already accepted. It is also important to note that the bioavailability of a compound to cells in in vitro culture is often much higher than its bioavailability in tissue.
Ironically, the non-threshold concept and its broad acceptance might have been flawed from the beginning (Calabrese, 2011). It appears that this is actually an example of unsuitable statistics, where deterministic calculations are used to handle rare events (carcinogenic effects at very low concentrations); a switch to probabilistic methods (see Chapter 1) might resolve this. For a lay audience, Taleb has explained the concept in his bestseller The Black Swan (Nassim, 2010).
Notably, this testing is applied very differently in different sectors, allowing for example, TTC approaches for food contaminants (Barlow and Schlatter, 2010;Kroes et al., 2004Kroes et al., , 2005Munro et al., 2008;O'Brien et al., 2006;Pratt et al., 2009), or margin of exposure (MOE) approaches (Benford et al., 2010), in which differences between actual human exposure and the point of departure of toxicity in animal experiments are used. Some authors have an alternative way of stating this discrepancy: "Analysis also indicates that many ordinary foods would not pass the regulatory criteria used for synthetic chemicals" (Ames and Gold, 2000;Silva Lima and Van der Laan, 2000). As an extreme example, two of the authors have shown that using the same regulatory approach for alcohol as for TCDD (dioxin) based on carcinogenic potential in rodents would allow a person to drink one beer in 345 years .
A similarly pragmatic approach allowing TTC would be likely to help in a large number of cases for cosmetics and other consumer products, without even the need to introduce new test approaches (Kroes et al., 2007). It should be noted that recent refinements include genotoxicity data, thus bridging to actual test data (Felter et al., 2009). The question that arises is how this approach can be validated. Indeed, this might actually be a case that is better suited to an evidence-based toxicology (EBT) evaluation (Hartung, 2010c) than a prospective ring trial.
Many toxic endpoints, especially in genotoxicity, rely on reactive chemistry allowing interaction with target structures -the absence of structural features allowing direct reactivity or activation via metabolism represents another example of exclusion of a hazard. Rather simple approaches give valuable information (Pelkonen et al., 2009), but it seems unlikely that this is sufficient for exclusion of a hazard. Still, assays like the peptide reactivity assay might be explored as to their predictive value for carcinogenicity.

Optimization of tests
Both genotoxicity tests (Speit, 2009) and the CTA (Combes et al., 1999) leave room for optimization, as discussed in part earlier. This might improve their value as stand-alone tests as well as test blocks in an ITS. It is worth noting that the goals of such

Information-rich single tests
The advent of "omics" (genomics, proteomics, etc.), image analysis systems, and other high-content measurement systems has introduced new opportunities for pattern recognition: instead of choosing a more or less meaningful endpoint to represent the response of a biological system, a multitude of signals can be recorded, hopefully including meaningful ones among many meaningless ones. The art lies in filtering the former, but the availability of bioinformatics tools for this purpose is increasing. The advantage of this approach is that the most informative biomarkers can be chosen in an unbiased way, independent of our initial understanding of a system. These can be individual endpoints as well as patterns, which we call "signatures of toxicity" (SoT). When combined with existing biological knowledge, such as our understanding of pathways from biochemistry, cell physiology, molecular biology or toxicology, these signatures can ultimately be translated for assessing perturbations of the living system, i.e., using a systems biology approach. At the level of signatures, this is a simple correlative approach with many limitations, including that epiphenomena cannot be distinguished from causal factors. For example, repair responses will correlate with damage, but obviously do not cause it. As a result, early response genes are typically seen when transcriptomics is used in toxicology. These approaches bear the risk of being non-specific and uninformative about the mode of action. However, some of these limitations might be overcome as our understanding of pathways of toxicity increases (see below). As Adler et al. (2011) note: "The mechanisms by which non-genotoxic carcinogens cause tumors are in most cases related to tissue-and species-specific disturbances in normal physiological control, gene expression patterns implicated in cellular proliferation, survival, and differentiation (Baylin and Ohm, 2006;Esteller, 2007;Widschwendter and Jones, 2002)." This is almost a definition of systems toxicology.
For carcinogenicity, in vitro transcriptomics approaches are emerging (Guyton et al., 2009;Jacobs, 2009;van Kesteren et al., 2011;Vinken et al., 2008). These have been applied initially to genotoxic carcinogens , but the approach makes just as much sense for the non-genotoxic mechanisms, and there are early indications that genotoxic and non-genotoxic effects can be discriminated (Magkoufopoulou et al., 2011;Plant, 2008). However, others have found that not even genotoxic carcinogens that do not function via DNA adducts can be identified (Benigni et al., 2010), although this identification seems to be possible in short-term animal tests (Fielden et al., 2008(Fielden et al., , 2011Waters et al., 2010).

Integrated testing strategies (ITS)
Our understanding of chemical carcinogenesis is continuously improving (Cohen and Arnold, 2011). This means a comprehensive testing strategy, designed to complement an optimized genotoxicity testing toolbox, must integrate more and more mechanisms and modes of action. There are already suggestions, however, for how to generate ITS for genotoxicity (Pfuhler et al., 2010b;Aldenberg and Jaworska, 2010).
A relatively simple ITS combining the Ames test with the CTA for Ames-negative substances resulted in impressive predictions of the cancer bioassay and reduced in vivo testing needs The US FDA used a test set with data from 185 substances. Fourteen groups submitted 111 models, but only five were better than random at a significance level of p=0.05, with accuracies of predictions between 25 and 79% (Toivonen et al., 2003). Two previous comparative exercises by the NTP had challenged models with 44 and 30 chemicals prospectively, i.e., with chemicals which were to be tested only (Benigni and Giuliani, 2003). The accuracy of in silico predictions in the first attempt was in the range of 50-65%, while the biological approaches attained 75%. The results in the second attempt (Benigni and Zito, 2004) ranged from 25 to 64%. In remarkable contrast, mere internal validations can show results of 75-89% predictivity for carcinogenicity (Matthews et al., 2006;Julien et al., 2004).
It is worth documenting that, although REACH guidance is overtly positive about (Q)SAR in respect to other tests, it reserves a reluctant tone for discussing the use of (Q)SARs in carcinogenicity testing: "The capacity for performing the standard rodent cancer bioassay is limited by economic, technical, and animal welfare considerations, such that an increased emphasis is being placed on the development of alternative, non-animal testing methods. However, carcinogenicity predictions through use of non-testing data currently represent an extreme challenge due to the multitude of possible mechanisms. Prediction of carcinogenicity in humans is especially problematic." However, it is important not to dismiss in silico options. While the REACH assessment is rather skeptical with regard to standalone in silico solutions, they have broad applicability in ITS or when combining in vitro and in silico techniques for a standalone test. As a recent consensus report concluded (Benfenati et al., 2009): "In silico methods can be used for priority setting, mechanistic studies, and to estimate potency. Ultimately, such efforts should lead to improvements in application of in silico methods for predicting carcinogenicity to assist industry and regulators and to enhance protection of public health." A worthy summary of the situation was given as early as 1994 by John Ashby: "The accurate prediction of chemical carcinogenicity can only be achieved by a balanced consideration of the following factors: the chemistry and metabolism of the test agent, the interaction between toxicity and genetic toxicity, the possibility of non-genotoxic events that trigger subsequent non-targeted mutagenesis, the difference between activities observed in vitro and in vivo, and the possible inadequacy and/or partiality of all datasets and observations. Extrapolation of activities within a series of congeners is usually possible, but predictions across different chemical classes/ mechanisms of carcinogenicity are difficult. Artificial intelligence systems can be used to predict one or more of the above parameters given adequate learning sets, but the hope for a single, coherent and self-contained method of predicting all instances of carcinogenicity is unreal. The future of carcinogen/ mutagen prediction lies with data-rich artificial intelligence systems based on known mechanistic principles used selectively within the context of chemical and biological human insight. The major current obstacle to progress is the assumption that mutagenicity and carcinogenicity are unitary phenomena that can be learned and predicted by artificial intelligence systems operating in isolation."

Specificity
In contrast to genotoxic compounds, which usually result in tumor development in several organs of the same animal species and even in several animal species, non-genotoxic carcinogens might be more specific with respect to their tumorigenic potential, as they frequently induce tumors in only mice or rats, one sex, and, in most cases, in one or few organs.

Existence of a threshold
Often tumorigenic effects only occur when high doses of a compound are used in order to produce prolonged interference with normal physiological control and modifications of cellular proliferation patterns, i.e., threshold doses exist. In addition to the identification of the existence of a threshold, it is crucial to assess the mechanism by which high doses exert a carcinogenic effect.

Reversibility
Tumorigenic/carcinogenic effects of non-genotoxic substances are observed only when a compound is continuously applied over extended periods and may be at least partially reversed after administration of the compound is discontinued.
by 90% (Benigni and Bossa, 2011) and the number of CTA by almost 50%. Even more tests were avoided (95%) when structural alerts were combined with the CTA. Several ITS have been proposed, but their composition has been based primarily on the expertise and opinion of their respective proponents, as no principles for ITS composition are available (Cohen, 2004;Combes et al., 2007), see Figures 5.4 and 5.5. There is already an ITS suggested in the ECHA guidance to industry (Fig. 5.6).
Lave and Omenn started to model the combination of the cancer bioassay with a screening test as early as 1988 (Lave et al., 1988). In addition to sensitivity and specificity, the prevalence of the hazard among the substances studied is key for such calculations. Taking into account the societal costs of misclassification, they suggest that the screening test employed must be either the most accurate or the least expensive.

Specific considerations for non-genotoxic carcinogens
Although different mechanisms may be involved in the carcinogenic action of non-genotoxic compounds, several common characteristics may be defined (Silva Lima and Van der Laan, 2000):  Cohen, 2004) Each box poses an evaluation to be performed. If the sequence results ultimately in a NO that is in a circle, there is no (or negligible) carcinogenic risk in humans. If the sequence results ultimately in a YES that is in a square, it poses a presumptive carcinogenic risk.
2003) for which ample in vitro testing systems are available. The same holds true especially for immunosuppression (Carfi et al., 2007;Galbiati et al., 2010;Gennari et al., 2005;Langezaal et al., 2001;Lankveld et al., 2010) but also for chronic cell injury, increased secretion of trophic hormones, oxidative stress, and reactive oxygen and nitrogen species (ROS and RNS). An assay for CYP450 induction in cryopreserved human hepatocytes also is currently under prevalidation (Abadie-Viollon et al., 2010;Richert et al., 2010). Despite the positive fact that existing in vitro tests are already available, and appropriate, for testing for a number of non-genotoxic mechanisms, there are other non-genotoxic mechanisms that may contribute to carcinogenicity and are much more difficult to assess in vitro, such as immunesurveillance of cancer cells.
These characteristics suggest that different mechanisms are involved and call for complementing the genotoxic test battery with assays that address pertinent non-genotoxic mechanisms (Tab. 5.4).
For example, literature surveys showed that 38 out of 48 endocrine-disrupting chemicals (79%) studied were positive in at least one cancer bioassay (Choi et al., 2004;Dietrich, 2010). A number of endocrine disruptor assays have been developed for the respective screening programs, and some of them have even been validated and accepted at OECD level. Mode of action based tests also are available for many other mechanisms, or they can be easily adapted from tests designed for other purposes; for example, inflammation represents another non-genotoxic mechanism (Emmendoerffer et al., 2000;Ohshima et al.,  The cancer bioassay is a "one size fits all" assay and is, by definition, problematic, as a testing assay can be either specific or sensitive -but not both. We believe that an alternative to the bioassay, in the form of an integrated testing strategy (ITS) using animal-free tests -which is then subjected to probabilistic risk assessment -would provide better information on the carcinogenic risks of new and existing chemicals to regulatory agencies and the public. In this chapter, we present a roadmap for how this might be achieved.
We start with a summary analysis of the cancer bioassay, but we stress that this is by no means a complete and objective assessment of the assay. Indeed, our primary conclusion from this paper is a strong recommendation that such an assessment be carried out. Only when this is objectively conducted will it be possible to move forward effectively in the development of alternative testing strategies.
We continue with an analysis of the assessment framework presented in Chapter 1 with respect to carcinogenicity. We do not believe that it will be possible to find a standalone, single in vitro assay for carcinogenicity testing, and that reduction of carcinogenicity to a key event or negative exclusion by lack of a key property is too simplistic for carcinogenicity testing. While the cell transformation assay (CTA) provides a surprisingly high reproducibility of results compared to the bioassay, these findings should be considered with caution, and we feel the CTA needs further evaluation.
Our vision for an alternative to the cancer bioassay is an ITS that uses a combination of in vitro and in silico techniques to assess both genotoxic and non-genotoxic carcinogenicity mechanisms. Furthermore, we suggest that testing should be separated from risk analysis, and that the latter should be done in a probabilistic, rather than deterministic, manner. Finally, we feel strongly that genotoxicity and carcinogenicity pathways of toxicity (PoT) should be investigated as part of the newly established Human Toxome project, and this will feed new information into the carcinogenicity ITS.
The complexity of potential targets and interactions for systemic hazards prompted the use of whole animal test models to mirror as many of these targets and interactions as possible. We increasingly understand, however, that these tests inevitably bring with them a high number of differences in these targets and their interactions. As far as available data can determine, the correspondence between different animal species for the cancer bioassay is not better than 57% in rats versus mice, and there is no reason to assume that any of them predicts humans better than they predict each other. Reproducibility issues, small group sizes, and poor statistics further limit the reproducibility of these assays. With the express purpose of erring on the side of safety, animal models have been rendered more sensitive (precautious) by high-dose testing, with an overemphasis on any positive (i.e., toxicity) findings and the two-species paradigm.
This situation results in two major problems: -There is no way to model the complexity of the hazard with simple systems. -The results (where available) from animal tests as such do not qualify to validate novel approaches against them.
Jaworska and Hoffmann have defined a framework for ITS that will inform toxicological decisions in a systematic, transparent, and consistent way (Jaworska and Hoffmann, 2010). They reviewed conceptual requirements for ITS and presented a roadmap to an operational framework that should be probabilistic, hypothesis-driven, and adaptive, as well as outlining properties an ITS should have in order to meet the identified requirements and differentiate them from evidence synthesis. We strongly recommend that an ITS framework along these lines should be applied to a battery of mode-of-action tests. An example of this process in the context of sensitization testing was recently published (Jaworska et al., 2011).

Pathways of Toxicity (PoT) and systems toxicology
The concept of PoT, as detailed in Chapter 1, is being pioneered in the EPA ToxCast project. Phase 1, focusing mainly on pesticides and off-the-shelf available pathway assays in HTS platforms, has delivered impressive first results supporting the concept of PoT for carcinogenicity and genotoxicity as well (Knight et al., 2009;Martin et al., 2009a). The current expansion to more substances and substance classes, as well as PoT assays, represents a prime opportunity to explore this approach. This approach, however, is limited by the use of known PoT and available tests. Unsupervised identification of PoT by mapping the human toxome is the logical complement to this approach. We strongly recommend that a genotoxicity and carcinogenicity branch of this activity be developed. In both cases there are (pre-)validated tests, as discussed above, and human-relevant reference substances available. This process will lead to new approaches in carcinogenicity testing, especially when it is combined with the HTS approaches of ToxCast using similar substance sets.
Studies of cancer biology already have identified 12 signaling networks that are important in oncogenesis. Almost all cancers show perturbations in molecules in one or more of these pathways. Although these networks do not represent specific PoT, they may be useful starting points from which to look for biomarkers and identify potentially carcinogenic PoT. However, it is important to note that these pathways, as they are understood at the moment, are not necessarily predictive of carcinogenicity, as perturbations in the pathways often arise only as a later outcome of a mutagenic effect. Also, some non-genotoxic carcinogenic effects, such as immune surveillance escape, may not implicate one of these 12 pathways.

Conclusions and recommendations: carcinogenicity
The cancer bioassay is a two-year test conducted in rats or mice and is currently the only accepted test for carcinogenicity. Testing a single chemical compound using the cancer bioassay requires the use of at least 600 animals and costs approximately € 1 million -yet the assay is estimated to have a concordance of only 57% between rats and mice and to predict 9 "innocent" chemicals as being carcinogenic for each one it correctly identifies. negatively filter substances for carcinogenic potential. Larger datasets will also benefit modeling attempts. Although some evaluations of the CTA have shown it to be a useful alternative to the bioassay, these should be treated with caution, as it is difficult to understand how a CTA assay can reproduce animal tests better than animal tests reproduce themselves. To have the CTAs better accepted, it would be good to have the applicability domain (chemical classes, etc.) retrospectively determined on the basis of the information in the detailed review document (DRP) prepared by the OECD. Furthermore, its predictivity of human carcinogens should also be addressed. OECD is currently planning a further review of the CTA, and the findings of this process should be carefully considered during the development of an ITS as an alternative to the bioassay. The suggested evaluation of the bioassay will have important implications for this review of the CTA assays. 4. Such optimization should include the combination with high-content measures, in silico analysis, and automation for HTS. 5. Carcinogenicity qualifies for ITS development with a number of assays representing non-genotoxic mechanisms lending themselves to evaluation. A "CarcinoTect" evaluation, in a similar manner to the ReProTect process that has been conducted for reproductive toxicology, may be a good starting point for the development of a carcinogenicity ITS. 6. With (pre-)validated cell systems and ample reference compounds, especially from the IARC process, PoT identification represents a key priority to accelerate Tox-21c. PoT identification requires the complement of probabilistic condensation of the information generated.
It is not realistic that any in vitro or in silico tool at this stage can be fully predictive of a human systemic toxicity. The questions that must be addressed, however, are how close can we come to this and how can we get closer to achieving that goal, especially by combining multiple approaches. In vivo alternatives to the bioassay do exist (especially the shorter assays in transgenic animals or when replacing the bioassay with a genotoxicity testing battery combined with a 28-or 90-day animal test), but they are beyond the scope of this paper.

Recommendations: carcinogenicity
The major recommendations from this report are: 1. It appears that the cancer bioassay has severe limitations.
Assorted data as to its validity are available, but many of the analyses are relatively old. An objective evaluation of the test using EBT approaches is warranted. This will document the limitations of the assay and allow a more critical assessment of when -and indeed whether -it should be used. A general feeling in the expert panel was that the assay qualifies for invalidation. A better understanding of the assay's limitations also will be informative for interpretation of assay results in cases where it is still used. We also feel that an objective evaluation will provide a helpful impetus for the search for alternative approaches. 2. It is clear that the cancer bioassay must not be the point of reference for validation exercises in future approaches. While the assay may continue to be used in some cases until alternatives are available, these alternatives must not be compared to the bioassay. 3. The CTA and the genotoxicity test battery merit further optimization and evaluation in order to positively or per substance. Another driving force is the European ban on testing for cosmetics ingredients (Hartung, 2008a). A series of activities by ECVAM, including several workshops, have tackled this challenge and will be condensed here. The Integrated Project ReProTect (Hareng et al., 2005) was one of its offspring, pioneering several alternative approaches. Reproductive toxicity aims to assess possible hazard to the reproductive cycle, with certain emphasis on embryotoxicity. Only 2-5% of birth defects can be associated with chemical and physical stress (Mattison, 2010). This includes mainly the abuse of alcohol and other drugs. For the assessment of the prevalence of effects on mammalian fertility, the available database is even more limited.
This roadmap paper also has benefitted from the recent discussions, including the recent detailed analysis of the 2013 marketing ban for cosmetic ingredient testing in Europe (Adler et al., 2011;Mattison, 2010). In addition, some activities under the auspices of ILSI/HESI and the US ToxCast project have helped to clarify opportunities and challenges. This paper will not always distinguish clearly between developmental and reproductive toxicity, simply considering developmental effects (teratogenicity) as the key concern within reproductive toxicity (which obviously also includes aspects of fertility and other impairments of the reproductive cycle). Developmental processes are especially difficult to assess (Knud- Developmental and reproductive toxicity was not in the foreground of safety assessments for many years after the shock of the thalidomide disaster (Kim and Scialli, 2011) had died down. More recently, the European REACH legislation, which is extremely demanding in this field (Breithaupt, 2006;Hartung and Rovida, 2009a;Rovida and Hartung, 2009;van der Jagt et al., 2004;Rovida, 2010), has stirred discussion again, notably because tests like the two-generation study are among the most costly and require up to 3,200 animals (two-generation study)  Pellizzer et al., 2005) high-dosage, effect-driven use, while chemicals, if at all, will typically affect the human body in a low-dose, long-term manner. Therefore, adapting the risk assessment of pharmaceuticals to chemical effects might not be appropriate. Despite that, the latter approach was introduced for chemicals several decades ago, but it held true only for new chemicals at a certain production volume. Very few new chemicals, however, are produced in high enough volumes to trigger such testing. Thus, experience with the predictive value and performance in general for ordinary chemicals is more than limited. So are the laboratory capacities available to carry out testing. Bremer et al. (2007) showed that in both the New Chemicals Database and the US EPA HPV database, any given reproductive toxicity test has been used for less than 3% of the notified substances (Bremer et al., 2007a) (Fig. 6.2).

A Roadmap for the Development of Alternative (Non-Animal) Methods for Reproductive Toxicity Testing
Fleischer has demonstrated the limited testing facilities and a lack of sufficient scientific/technical know-how (Fleischer, 2007): A survey including 28 major independent and corporate laboratories in Europe indicated that only 11 offer two-generation studies with a capacity of 28 substances per year. This total suggests a capacity to carry out about 50 parallel, two-generation studies in Europe, each lasting about two years. Thus, every year 25 new substances can be included. The majority of this testing capacity is employed for drugs and pesticides. Only about three general chemicals per year have been tested in two-generation studies since the introduction of the Dangerous Substances Directive in 1981 (Fleischer, 2007). Thus, testing of hundreds or even thousands of chemicals in the context of REACH will overwhelm available test capacities. This calls for adequate prioritizing to make best use of these limited resources as well as for the use of any other means to satisfy the information requirement by way of an alternative and integrated testing strategy. sen et al., 2011), as the timing of processes creates windows of vulnerability, the process is especially sensitive to genetic errors and environmental disruptions, simple lesions can lead to complex phenotypes (and vice versa), and maternal effects can have an impact at all stages.

Current testing
The treatment of one or more generations of rats or rabbits with a test chemical is the most common approach for identifying chemically induced adverse effects on reproduction (Fig. 6.1). For evaluating developmental toxicity, test guidelines were designed to detect malformations in the developing offspring, together with parameters such as growth alterations and prenatal mortality (Collins, 2006). Developmental toxicity tests are considered mainly as screening tests (especially for REACH ). The shorter and less complex "screening" tests, which combine reproductive, developmental, and (optionally) repeated dose toxicity endpoints into a single study design, are variants.
As a result of these studies (Tab. 6.1), a No Observed Effect Level (NOEL) is determined. These data then are extrapolated from animal studies to humans. In this process, safety factors are applied. This safety factor is normally 100, i.e., 1% of the dose that did not cause any adverse effects is considered safe in humans (acceptable daily intake values). The value of 100 is a common default as a safety factor (based on the assumption that 10 is an estimate of interspecies and another 10 of intra-species differences), but justifiable deviations are possible in both directions.
Reproductive toxicity testing has not been developed for, nor been largely applied to, chemicals in general -which is often overlooked -but has been used predominantly for pharmaceuticals and pesticides. Pharmaceuticals are designed for oral,  Yang and Ann Richard, personal communication;see also Singh et al., 2010). The EPA Integrated Risk Information System (IRIS) contains comprehensive reviews for 553 environmental chemicals (as of April 2010), and identifies the most sensitive or 'critical effect' as the basis for setting safe exposure levels to protect the public health. The critical effect is the first observed effect deemed adverse that is likely to occur in the most sensitive species as the dose rate of an agent increases (IRIS, 2010). Less than 2% of 533 IRIS assessments report the critical effect for the derivation of a noncancer reference value (i.e., a safe exposure level) as being a developmental (5 of 553) or reproductive (4 of 553) effect (http://www.epa.gov/IRIS/). This may be due to other effects being more sensitive, but more likely due to a lack of de-The expense and animal use associated with reproductive toxicity testing is questionable when considering that reproductive toxicity is most probably an event with a low frequency in the universe of industrial chemicals. An independent expert panel of industrial reproductive toxicologists has concluded that, in all likelihood, less than 5% of industrial chemicals possess properties that could be harmful to the developing child. We have found, by reviewing the New Chemical Database of the ECB, that 15 two-generation studies have led to only one R60 classification, whereas 58 one-generation studies have led to three classifications (Bremer et al., 2007a).
Publically available data on reproductive toxicity are very rare. Less than 5% of dossiers in the US EPA HPV database or the EU New Chemical Database (not public) contain any data in this field (Bremer et al., 2007a). Knudsen et al., have analyzed available data (Knudsen et al., 2011) in various databases: "NIEHS' National Toxicology Program (NTP) online database, for example, provides developmental effects data on only about 3% of the listed chemicals (70 of 2,330). Other da-

Framework for replacing systemic toxicity by novel approaches
This framework is presented in more detail in Chapter 1. The following approaches to overcome animal testing for a given area were identified: 1. Abolition of useless tests 2. Reduction to key events 3. Negative exclusion by lack of key property 4. Optimization of existing tests 5. In silico approaches 6. Information-rich single tests 7. Integrated testing strategies (ITS) 8. Pathways of Toxicity (PoT) and Systems Toxicology The distinction between (2) and (3) was made to stress that identifying positive or negative substances for a given hazard represents different approaches with different requirements as to prediction models, statistics, etc. Note that this framework remains largely on the level of hazard identification. Dose-response considerations and quantitative extrapolation to humans are not considered.

Abolition of useless tests
Every model has limitations -this holds true for in vivo (Hartung, 2008b), in vitro (Hartung, 2007a), and in silico (Hartung and Hoffmann, 2009) approaches. It is rare that a model contributes so little that it should be abandoned. In particular, it is impossible to predict whether a model cannot be improved to make a useful contribution in the future. The proposal by Balls and Combes (2005) to formally invalidate useless tests was the topic of a joint FRAME/ECVAM workshop (Balls et al., 2006). The participants finally agreed that invalidation makes sense only for prescribed regulatory tests, since the potential remains for further development and possible inclusion into the regulatory toolbox. Even though reproductive toxicity testing is not likely to be a candidate for abolition, it is worthwhile to apply criteria that typically are used for novel tests to illustrate the performance of the traditional tests.
The weaknesses of current developmental toxicity safety assessment were recently summarized as follows (Carney et al., 2011): -Large numbers of animals required -High cost per compound (>$ 100,000 per study) -Long time requirements to evaluate each compound -Capacity gap: cannot keep pace with increasing demands to evaluate existing and new chemicals, as well as mixtures -Maternal toxicity: can confound data interpretation -Fundamental knowledge of developmental biology for current animal models (e.g., rat, rabbit, monkey) is sparse relative to mouse or lower organisms -Uncertainty regarding interpretation of low incidence findings -Large amount of effort placed on the evaluation of minor skeletal variations with little impact on risk assessment velopmental and/or reproductive effects data, which contributed to an increased uncertainty in the database for the choice of the critical effect, and resulted in a lower reference value in 85% of the cases where an uncertainty factor for an inadequate database was used. Finally, in one of the largest data compilations from multiple resources to-date, EPA's Aggregated Toxicology Resource (ACToR) identified available developmental toxicity data for less than 30% of the 9,912 chemicals in commerce or of environmental interest, out of a chemical domain of 418,513 generic chemicals (Judson et al., 2009)." It needs to be stressed that the described effects do not automatically point to impaired mammalian reproduction, but only to observed histopathological effects. The prevalence of reproductive toxicity is, most probably, lower than this query demonstrates.
To overcome low sensitivity, regulatory bodies often request testing in a second species. It should be stressed that the sensitivity of the test design requesting two species is still unknown. But the consequence of requesting two species is dramatic: By assuming a maximum prevalence of 5% for developmental toxicity in the universe of industrial chemicals, and by requesting additional testing in another species in case of a negative first study, the number of animals needed for developmental toxicity testing is nearly doubled. Fortunately, in a 2009 amendment to REACH, the original consideration of a second species was removed, though the respective guidance for developmental screening by ECHA has not yet been adapted . In addition, a side-effect of requesting a second species that is often overlooked in the current testing practice but that will have a high impact on large testing programs, is the increase in the rate of false-positives, and therefore the unwanted restrictions of valuable substances (Hartung and Rovida, 2009a;Hartung, 2009a).
Many regulatory agencies have recognized the need for a transformative shift and have initiated research programs to achieve the vision and goals laid out by the NRC (Leist et al., 2008b;NRC, 2007). These include the NIEHS NTP Roadmap for the 21st Century from 2004 (National Toxicology Program, 2004) and the FDA Critical Path Initiative (Woodcock and Woosley, 2008;Woosley and Cossman, 2007)  Approaches for Testing and Assessment, and actively utilizes Test Guideline Committees and a QSAR Expert Group to ensure global harmonization and validation of any new approaches. What is most astonishing is the fact that we see more US and international activities than European contributions, though at this moment the highest demand for change is created by European legislations; efforts in the EU are mainly carried out by research consortia between academia and industry, with typically only long-term perspectives for transition into regulatory use.
suming and labor intensive, and require a significant investment in examiner training in fetal morphology, coupled with extensive proficiency testing.
One issue with skeletal evaluation is the interpretation of minor skeletal variations and their impact on risk assessment. This issue was the subject of a previous ILSI-HESI expert panel project … (Daston and Seed, 2007). Depending on the laboratory's evaluation scheme, a large number of individual skeletal variations often are recorded and some occur at a very high incidence (sometimes >80%), even in control animals. Many laboratories distinguish between several subtly different degrees of ossification of individual bones, leading to a large volume of statistical analyses and evaluation of corresponding historical control data (reviewed in Carney and Kimmel, 2007). Although the practice of recording minor skeletal variations was established many years ago, we have since learned that the skeletal system possesses an extensive capacity to remodel during postnatal development, and current evidence indicates that many of the minor skeletal variations present in the term fetus are no longer evident postnatally. ... Thus, minor skeletal variations, particularly findings such as wavy ribs and minor delays in ossification are generally not considered adverse in and of themselves (Carney and Kimmel, 2007). ... The interpretation of fetal malformations can also be a challenge, particularly when faced with a low incidence of a particular malformation occurring in the high-dose group only. As highlighted by Palmer many years ago, 'because low rates of malformation are the rule, one faces the recurring nightmare of deciding whether one or two malformations are related to treatment or accidental' ... Currently there are few options for resolving these issues, which is of particular concern given the enormous impact on regulation of the chemical as well as the potential labeling of the compound as a teratogen. In some cases, the studies have been repeated using extremely large sample sizes, but this is obviously problematic in terms of animal use, costs, and time. Mechanistic studies are another option, although these may only be possible if higher doses can be used to increase the incidence. … statistics often are of limited help in resolving these uncertainties, as very large numbers of offspring are needed to achieve the statistical power needed to detect an increase in low incidence malformations. To overcome some of these statistical limitations, historical control data are considered in judging whether or not a low incidence finding seen in a treated group might have been a chance occurrence. However, historical control data should be used judiciously and within a reasonable time frame, as drift in the background incidence can occur over time, as can sudden spikes in the incidence of a particular effect." The very extensive analysis by Holson et al. is based on experience with about 1,500 studies (Holson et al., 2006). It also is based on a 1984 analysis carried out by the National Center for Toxicological Research on behalf of FDA entitled Reliability of Experimental Studies for Predicting Hazards to Human Development, which was never published in the open literature. They show the background of "abnormal" reproductive outcome, for example the spontaneous resorption of small litter: 43% of rabbits with a single implant resorbed it and 10% terminated -Use of high doses that sometimes far exceed human exposure levels At the same time there is increasing doubt as to the usefulness of the 2 nd generation for testing of substances. Janer et al. (2007) have shown in a retrospective analysis that this made no relevant contribution to the regulatory decision-making. US EPA obtained similar data (Martin et al., 2009a) supporting the development of an extended one-generation study (TG 443, OECD;OECD, 2011), originally proposed by the ACSA initiative. Though of lesser relevance here, this shows that (elements of) study protocols can indeed be useless and warrant critical assessment.
Another way of asking the question of relevance is whether the test is more sensitive (responsive at lower concentrations) for reproductive toxicity than the maternal toxicity, i.e., repeated-dose toxicity. For this comparison, Martin et al. (2009b) analyzed data in ToxRefDB for 254 chemicals tested in both multigeneration and 2-year chronic studies, and 207 chemicals tested in both multigeneration and 90-day subchronic studies: "For the majority of chemicals, potency values between the multigeneration, chronic, and subchronic studies were comparable, with a general linear relationship falling within ten-fold of each other. However, for four chemicals ... that caused parental or reproductive effects in the multigeneration study, there was no systemic toxicity observed in either the chronic or subchronic studies. For another five chemicals ... potencies for the most sensitive multigeneration endpoints were more than 10-fold greater than for the most sensitive effects in chronic studies. Of these five chemicals only thiamethoxam was more potent based solely on reproductive endpoints, that is, testicular atrophy." This means with an assessment factor of 10, the hazard of reproductive toxicity might be covered for 99.8% of substances.
The assessment here will be based on the most common criteria for validation .

Standardization of protocols
The protocol has recently been critically reviewed by Holson et al. (2006) and more recently by Carney et al. (2011), who conclude: "Developmental toxicity safety assessment is mainly a descriptive science designed to detect adverse developmental outcomes, namely teratogenicity, intrauterine death, intrauterine growth retardation, and functional deficits. Evaluation of teratogenicity requires detailed examinations of fetal morphology, including external features, internal organs and tissues, and assessment of more than 200 bones of the fetal skeleton. These assessments have evolved over time, such that very subtle changes (often called variations) can be detected, in addition to (real malformations).
The descriptive nature of these fetal examinations brings with it some critical challenges … One is that the evaluation criteria and nomenclature for fetal morphology has been difficult to standardize across different laboratories. Although this problem would seem to be easily remedied, it has been difficult because individual laboratories have built up large volumes of historical data based on their own criteria, and they also may use different animal strains and evaluate fetuses on different days of gestation. Fetal examinations also are very time con-Reproducibility These screening protocols have been employed mainly in national and international programs to gather screening-level data for chemicals. However, this study design has limited sensitivity and produces a high level of equivocal results that often have to be further evaluated in more "definite studies," such as a prenatal developmental toxicity study and/or a twogeneration study. Given that the screening requires 560 ani-pregnancy prematurely via abortion. 3% and 5% abnormal outcomes were found for 2 and 3 implants, respectively. The authors also suggest: "The slope of the dose-response curve (is) ... often steeper in developmental toxicity studies than in other toxicity studies," which means that effects occur only close to maximum tolerated doses, which "grossly overpredict risks." Another problem they identify is the high background of spontaneous adverse developmental outcomes (Tab. 6.2). The most commonly occurring manifestations of these findings are: (1) right carotid and right subclavian arteries arising independently from the aortic arc (no brachiocephalic trunk), (rat, rabbit, or mouse) was capable of detecting more than 61% of the teratogens. However, this study should be interpreted with caution since Schardein (2000) has provided an extensive study in which several hundreds of chemicals have been assessed for their interspecies variations. Bailey (2005) examined the data for 11 groups of known human teratogens across 12 animal species and found huge variability in positive predictability, with a mean of 61% (Bailey et al., 2005): "Of the 139 individual classifications across the species tested, a total of 78 (56%) were positive; the remaining 44% of results were almost entirely negative. The only encouraging aspect to come from these statistics appears to be the high positive predictability score for the hamster; however, the USFDA published a report detailing the responses of the mice, rats, rabbits, hamsters, and monkeys to 38 known human teratogens in which the high scoring hamster produced only a 45% rate of correct positives (USA FDA Federal Register 'Caffeine,' 1980). Furthermore, the mean percentage of correct positives from any one of these species was only 60%... The US FDA report also analyzed the rate of concordance between these species and humans for 165 compounds known to be non-teratogenic in the latter; the 'order of merit' for each species and its negative predictive value were completely different from that for the positive predictive values, ranging from 80% in monkeys to 35% in mice and hamsters. The mean negative predictive value for any of these species was 54%. Taken together, these predictive values of 60% and 54% for human teratogens and human non-teratogens, respectively, represent a poor return on the investment of animals, time, labor and money. The 57% mean value is little better than the 50% that would have been obtained by pure chance." The "precautionary" response of regulatory toxicology was to test in more than one laboratory animal species in order to reduce the 40% missed potential developmental toxicants. However, this inevitably increases the already 40% false-positive classifications (Hartung, 2009a). Whether we can afford this substantial over-labeling, especially in high-production volume chemical evaluation programs, has been discussed elsewhere (Hartung and Rovida, 2009a).
Discordance in developmental toxicity testing certainly seems to conflict with the widely held dogma stating that the basic events in embryo development are highly conserved across species, even for species as disparate as fruit flies, frogs, mice, and humans. This degree of conservation mainly applies to the most fundamental processes in embryogenesis, such as establishment of the general body plan, pattern formation, cellular induction, and regulation of differentiation via signaling pathways. On the other hand, pharmacokinetics and, in particular, maternal metabolism, can vary widely between species and are likely to drive interspecies discordance. Placental anatomy and physiology also vary greatly between conventional test species and humans. In fact, rats, mice, and rabbits utilize two very different types of placentae -the inverted visceral yolk sac placenta which is extremely important in early pregnancy, as well as a chorioallantoic placenta which does not become functional until mid-pregnancy. In contrast, humans only utilize a chorioallantoic type of placenta throughout most of gestation (Georgiades et al., 2002). Holson et al. (2006) list the following limitations for reproductive toxicity assessments for the most common species: mals per test, the application of this test in its present form as a screening tool should be reconsidered for large toxicological programs. The reasons for equivocal results can be several: One is that the data are simply inconclusive; another is that this is related to either variability or lack of reproducibility. Thus it is either reproducibility or robustness of the test that has an impact on reproducibility. An improvement of the test design to increase accuracy of the test by reducing the number of equivocal results is desirable. Notably, the "definitive" multi-generation studies also have a high rate of equivocal results: "The number of equivocal results remained high across these six species at just under 25%" (Bailey et al., 2005). Hotchkiss et al. (2008) addressed the inherent variability of the litter-based endpoints: Power calculations were calculated for categorical effects based upon the numbers of malformed males versus males without malformations per dose group: "If 20 animals per dose group are examined for malformations, then lesions occurring at an incidence of 25% or greater can be detected, whereas an incidence of 10% can be detected if all the pups are examined from 20 litters. If only ten males per group are examined, as recommended for histopathological analyses in some regulatory agency test guidelines, then effects are only detected statistically if about 50% or more of the tissues/organs are affected; a level of statistical power that many would consider inadequate."

Scientific Relevance
The relevance of studies raises a concern: "However, if dosing was high enough to cause the above described 'maternal toxicity,' these doses often also cause some effects in offspring. So the crux is that, on one hand the experimenter must apply high doses in order to fulfill the guideline requirements, while on the other hand results achieved at such doses may lead to the classification of a compound." Holson et al. (2006) observe the problem of statistics applied without correction for the multiple endpoints assessed: "Because, for example, a standard developmental toxicity study with ANOVA/Dunnett's and Kruskal-Wallis/Mann-Whitney statistical analyses performed on all parametric and nonparametric data, respectively, may involve as many as 100 to 300 individual hypothesis tests, the possibility exists for numerous spurious statistical findings." Another biasing effect is the "litter effect," i.e., the common observation that several fetuses of the same litter are affected, thereby "... artificially inflating the apparent group response" and leading to false-positive results.

Predictivity of point of reference (human reproductive toxicity)
The ability of animal models to predict the human response is a fundamental assumption in developmental toxicity and risk assessment, yet varying degrees of discordance among species are very common in actual practice. Pronounced interspecies variances have been described showing not more than 60% correlation between different laboratory mammalian species in the area of developmental toxicity. There is no reason to assume that any species predicts humans better than, e.g., mice predict rat developmental toxicity of a given chemical. Hurtt et al. (2003) have demonstrated by analyzing 91 veterinary drugs that no single species especially at normal exposures and therapeutic dose levels. Notable examples include glucocorticoids, benzodiazepines, caffeine, carbon dioxide, dopamine, indomethacin, and aspirin (Bailey et al., 2005;Hartung, 2009c). A simple calculation shows that a prevalence of 2.5% reproductive toxicants in humans among industrial chemicals when tested in two species (correlating with each other and humans at 60%) will result in 65% of all substances labeled false-positive, while 2.1% realpositives (85% of all positives) of the toxicants are identified (Hartung and Rovida, 2009a;Hartung, 2009a).
In 1983, Brown and Fabro estimated, that "Of those agents thought not to be teratogenic in man, only 28% are negative in all species tested" (Brown and Fabro, 1983) (Tab. 6.3).
They also did not find a strong concordance of potency (Tab. 6.4).
"Rat -susceptible to dopamine agonists (dependence on prolactin for maintenance of early pregnancy), prone to premature reproductive senescence following treatment with GABAnergic and other CNS-active agents, increased susceptibility to Leydig cell tumors, increased susceptibility to mammary tumors, inverted yolk sac placentation, limited fetal period. Rabbit -Consume diet inconsistently, prone to abortion and toxemia, induced ovulatory, sensitive to local gastrointestinal disturbances (e.g., antibiotics), not routinely used in repeated-doe toxicity studies, prone to resorption when few implantations are present, inverted yolk sac placentation."

Specificity
There are many examples of positive results in the routine species that have little or no effect in humans ("false-positives"),

Tab. 6.3: Concordance of human and animal teratogenicity data
(modified from Brown and Fabro, 1983) Human teratogens ψ Human non-teratogens - Two or more species 80% Two or more species 50% Any one species 97% All species 28% From US FDA ψ 38 compounds: "reports of birth defects in humans associated with intake." -165 compounds: "for which human teratologic effects have not been reported." ǂ From the published information, the exact meaning of an 85% response rate is not clear. It could mean, for example, 85% of the agents were positive in at least one mouse study, or of all tests of these agents in the mouse, 85% were positive.

Tab. 6.4: Comparison of teratogenic potency of chemicals in humans and animals
(modified from Brown and Fabro, 1983) Lowest effective dose (mg/kg/day) with fetal malformations, but is not teratogenic in the rat (Bailey et al., 2005). However, the identification of known human teratogens also was not necessarily in the routine species, and we have to keep in mind that this was often retrospective analysis, where the effect in humans was known and was looked for in the animal studies, creating considerable bias.

Applicability domain
An applicability domain, i.e., the part of the chemical universe where the animal tests give correct predictions, has not been established for the different animal studies.
Taken together, no comprehensive critical evaluation of current in vivo testing is available, with the exceptions of a book chapter by Holson et al. (2006) and two narrative reviews (Bailey et al., 2005;Carney et al., 2011). There is some concern, which warrants a systematic review. Evidence-based toxicology offers a toolbox for such evaluations. Hartung and Hoffmann (Hartung, 2010c;Hoffmann and Hartung, 2006) conclude: "Thus, a crucial need remains for an organized and critical analysis of the primary literature in reproductive toxicology to evaluate the concordance of regulatory reproductive toxicity studies to human exposure outcomes." A critical problem is that reference data from humans are difficult to obtain from epidemiology (Friedman, 2009). Our knowledge of human teratogens is very much limited to drugs. Furthermore, we are lacking a process such as IARC for carcinogenicity in the field to achieve consensus on the reproductive toxicity of substances. Validation of novel tests against the traditional animal models should be done with Similarly, Bailey and Knight (2005) summarized their collected data (Bailey et al., 2005): "This means that of 1223 definite, probable, and possible animal teratogens, fewer than 2.3% were linked to human birth defects." The consequence of low specificity in order to boost sensitivity, which can be seen as "precautionary," creates concerns as to the societal costs (Durodie, 2003). A breakdown of embryotoxic effects of 74 industrial chemicals, which have been tested according to EU Directive 67/548/EEC B31 in the New Chemical Database, showed that 34 chemicals have demonstrated effects on the offspring, but only two chemicals have been classified as developmentally toxic according to the standards applied by the national competent authorities . This demonstrates the lack of confidence in the specificity of this "definitive" test.

Sensitivity
The same analysis by Bremer and Hartung (2004) showed that 55% of these chemical effects to the offspring could not be detected within multi-generation studies ( Fig. 6.3), which suggests that either the developmental toxicity screening tests are over-predictive or that the multi-generation assays lack sensitivity . This is in contrast to claims that "... Every chemical or drug known to be teratogenic in humans, with possibly two exceptions, is also teratogenic in one or more laboratory species" (Schardein, 2000). One such exception is the prostaglandin E1 analogue misoprostol: Treatment of humans with this drug for peptic ulcer disease or to initiate labor has a strong association  Bremer and Hartung, 2004) The figure shows that not all embryotoxic effects will be picked up in one/two generation studies. Additional developmental toxicity tests are necessary. Further investigation is necessary to understand whether one/two generation studies can be combined with a set of in vitro methods for developmental toxicity in a conceptual framework that could perform reliable hazard identification of a chemical. seeks targeted testing that provides sufficient toxicological data for hazard identification, but also keeps in vivo testing to a minimum.
A frequent possible scenario for an alert-driven strategy could be unclear histopathological observations in the testes in subacute or chronic toxicity studies. These findings should not automatically trigger additional animal-intensive tests for reproductive toxicity. These effects should be further explored, however, by using in vitro testing batteries analyzing cytotoxic effects on specific cell populations of the reproductive organs and/or by analyzing relevant hormone production or by monitoring gametogenesis in vitro. The obtained data will identify if observed changes in the tissues of reproductive organs are reprotoxic effects or if the observed effects are related to general toxicity. The establishment of relevant databases (Judson, 2010) such as the Fraunhofer society's database (Bitsch et al., 2006) or Tox-Cast's ToxRefDB (Knudsen et al., 2009;Martin et al., 2009a,b) will support the development of such a scientific approach. A query of the former database (Bremer et al., 2007a) containing 329 chemicals tested in repeated dose studies (rats) and 203 chemicals (mice) has demonstrated that major targets of chemicals showing toxicological effects on the testes are target cells that can also be cultured in vitro. However, substantial research efforts are still necessary to maintain the functionality of target cells in vitro and to convert these in vitro models into predictive tests using specific functions as toxicological endpoints. Changes of the functionality of certain target cells will point to the relevant target mechanisms and will support the interpretation if the observed effects are relevant to humans.

Negative exclusion by lack of key property
Instead of positively identifying a key property, which would lead to classification, it is often more attractive to exclude a key property to come to no classification. This is especially the case when hazards are relatively rare and positive identification will not save many test efforts. Properties that especially come to mind are the barrier models suggesting limited bioavailability. This can be oral availability on the side of the mother or the placental barrier. Models are available for both (Bremer et al., 2007a;Mose and Knudsen, 2006;Myren et al., 2007;Poulsen et al., 2009) and oral uptake (see Chapter 2), but a key problem is whether they are sufficiently predictive to completely rule out a possible effect, especially as the placental barrier changes its properties over time during pregnancy. Bremer (Balls and Combes, 2005) is not likely to find major support in the absence of valid alternative approaches, given the importance of this subject. Nevertheless, it might be worthwhile to examine the in vivo reproductive toxicity tests applying the principles of evidence-based toxicology (EBT) (Hartung, 2009b). The development of in vitro methods might be furthered by evidence that the current test system is not providing the safety information we are looking for.

Reduction to key events
Most reproductive toxicity testing is done to exclude teratogenic effects. For this reason, an early focus in alternative method development was on tests for embryonic malformations (Augustine-Rauch et al., 2010). The most complete reflection of embryonic development apparently can be achieved with zebrafish embryos (Selderslaghs et al., 2011;Sukardi et al., 2011;Weigt et al., 2010Weigt et al., , 2011Yang et al., 2009), for example using dynamic cell imaging, or frog eggs (FETAX assay) (Hoke and Ankley, 2005), which has been evaluated more critically by IC-CVAM 1 . It seems to be timely to evaluate available protocols and datasets and define a protocol for formal validation.
By 2002, three well-established tests had already been validated, i.e., the mouse embryonic stem cell test, the whole rat embryo culture, and the limb bud assay (Genschow et al., 2002Piersma et al., 2004;Spielmann et al., 2004). They obviously cover only a small part of the reproductive cycle and only a small though critical part of embryonic development. Among them, the murine embryonic stem cell test (EST) has attracted most interest. Originally a counting of beating heart cells formed, it is now adapted to other endpoints and to human cells (Leist et al., 2008a). At present, the EST has its application primarily in in-house hazard identification. To reach regulatory implementation, further characterization is needed, such as definition of biological and chemical applicability domain, mechanistic studies to identify developmental pathways, comprehensive comparison of the developmental processes active in EST, differentiation with in vivo embryogenesis, and ultimately predictability of the EST for the developmental phase covered.
The entire reproductive cycle with its vulnerabilities most probably cannot be broken down to one or few key events. For practical purposes, however, we might test for these, especially when certain alerts lead to these test needs, typically from findings in repeated-dose testing. Why study the entire reproductive cycle when an alert already hints at a certain problem? If these data are insufficient for regulatory decisions, but alerts have been identified, the existing data can be used as the basis for the development of a tailored testing scheme. Depending on the nature of the alerts, test batteries of specific validated in vitro tests could be triggered in order to confirm or refute observed concerns. For example a histopathology in testes observed in repeated-dose studies will be followed up by tests on spermatotoxicity models, not a two-generation study if the classification cannot be done based on the finding alone. This approach, which we termed "alert-driven testing" (Bremer et al., 2007a), bryotoxic potentials of even very closely related structures (de Jong et al., 2011). Optimizations of protocols were reported (De Smedt et al., 2008;Seiler et al., 2004;Seiler and Spielmann, 2011). In order to move towards transcriptomics read-outs, PCR was employed to monitor specific gene expression (Pellizzer et al., 2004). Knudsen and colleagues (Knudsen et al., 2011) demonstrated the mEST's ability to capture data on disruption of developmental signaling pathways as a potential alternative for assessing developmental toxicity. "His example focused on the expression of genes for the 17 + 2 conserved signaling pathways critical to early development (National Research Council, 2000), taking the hypothesis that an abnormal activation or inhibition of signaling pathways can lead to developmental toxicity. The test system uses murine ESCs cultured 3 days as hanging drops that form 'embryoid bodies' with gene expression patterns for ectodermal, mesodermal, and endodermal lineages. Analysis of gene expression at 5 days revealed the top expressed signaling pathways as Cadherin, Wnt/β-catenin, Hedgehog, Integrin, ND, Nuclear Hormone, and Receptor Ser/Thr kinase." The EC report (Adler et al., 2011) (cited literature there) gives a comprehensive overview on variants of the embryonic stem cell tests with respect "... to their readouts but also in the target cell differentiation (Peters et al., 2008;Zur Nieden et al., 2004). Depending on the area of application, effects on differentiating neural cells (Stummann et al., 2009b;Theunissen et al., 2010), cardiomyocytes (Buesen et al., 2009 and skeletal cells (Stummann et al., 2009b;Zur Nieden et al., 2004;Zur Nieden et al., 2010) have been investigated. Effects on the quantity of differentiated target cells have been assessed by using immunological methods such as flow cytometry (Buesen et al., 2009) or molecular biological methods such as RT-PCRs and omics (Chapin et al., 2007;Osman et al., 2010;van Dartel et al., 2009;van Dartel et al., 2010;West et al., 2010;Winkler et al., 2009;Zur Nieden et al., 2001;Zur Nieden et al., 2004). Several of the methodologies could also be automated in order to increase the throughput of substances and make the test available for screening purposes (Peters et al., 2008)." A key development certainly is to translate the EST to human stem cells (Pal et al., 2011;Pellizzer et al., 2005;West et al., 2010). This promises, finally, to overcome species differences for the key health concern of reproductive toxicity.
A key limitation of many in vitro tests is the lack of metabolizing capacity (Coecke et al., 2006). Efforts to combine the EST with metabolizing systems have been described (Bremer et al., 2002;Hettwer et al., 2010) with kinetics modeling. Another improvement represented the combination with kinetic modeling (Verwei et al., 2006). Similar optimization work was also carried out for the whole embryo culture (Piersma et al., 2008) adding metabolizing systems (Luijten et al., 2008). The added value and validity of these variants should be assessed systematically. Notably, the modular approach  would allow assessing only the aspects that have been changed, and, by establishing performance standards for the murine EST, validity could possibly be established with reasonable effort.
Other promising approaches include the use of, or combination with, computation models of development pathways and systems and, finally, high-throughput in vitro approaches as, hindering the access of a chemical to it. ... The toxicant concentration reaching the embryo is a critical factor in developmental toxicity. Among the mechanisms regulating the disposition of toxicants from the maternal circulation to the embryo, drug efflux transporters play a key role, and are possibly responsible for interspecies variability." This argues very much for the use of human placentas, which are relatively easily available.
A very promising approach is to use thresholds of toxicological concern (Kroes et al., 2005), i.e., TTC, to define exposure limits below which an effect is sufficiently improbable, as reproductive toxicity is considered a threshold effect . Van Ravenzwaay and coworkers (2011) determined such a TTC for reproductive toxicity at 8 μg/kg bw/d. These approaches can be further refined either by distinguishing classes of chemicals or using internal TTC, i.e., basing the threshold on plasma concentrations actually achieved. An interesting option would be to use experimental barrier model data to modify the TTC level.

Optimization of existing tests
Reproductive Toxicity is by its very nature characterized by "complexity layered on complexity," and the devil might be found in the details. Reproducibility, robustness, and reliability combined with a relevant, sound scientific base will be critical for an acceptable test going forward.
The three embryotoxicity tests validated in 2002 have received considerable interest for further optimization. In order to review and discuss the next steps of using the tests, an ECVAM workshop was held in January 2003 (Spielmann et al., 2006). A panel of 12 European and American experts from industry, academia, and governmental institutions analyzed the tests for chemical and pharmaceutical safety testing in vitro. The outcome of the workshop can be summarized as follows (Spielmann et al., 2006): 1. The tests are reliable and transferable to other laboratories. 2. The prediction models need to be revised in order to receive a better discrimination between non-and weak/moderate embryotoxic chemicals. 3. The tests should also be applied to industrial chemicals to demonstrate the reliability and relevance of the system, since within the formal validation study primarily pharmaceuticals have been tested. 4. The selected strong developmental toxicants represent a limited number of mechanisms of toxicity, mostly affecting cell proliferation. Strong embryotoxic chemicals with other toxicological mechanisms should be tested in order to enhance the reliability for a wider applicability of the tests for a broader range of chemicals. 5. A metabolic system to detect proteratogenic compounds has to be integrated in order to extend the applicability. 6. Other differentiation pathways have to be included in the tests. Additional major target tissues such as the nervous system and the skeletal system have to be included in order to get precise information about the teratogenic potential of chemicals. A lot of work has been done to further optimize the standard murine EST, which was shown to distinguish the different em-developmental toxicity data is needed. Such a compilation would allow investigators to methodically examine a range of considerations when selecting and utilizing toxicological data in training sets (i.e., various experimental factors, various approaches to combining/separating categories of endpoints, and alternative scoring systems); 2) Training sets for discrete developmental endpoints should be developed. This would allow examination of the process used to assemble training sets, as well as the effect of alternative processes on the predictive performance of the model. The Working Group considers these first two recommended efforts, compiling/analyzing a comprehensive database and developing/investigating alternative training sets, as complementary and iterative exercises; 3) The combined use of multiple types of tools and approaches for screening should be investigated.
In conclusion, the Working Group recognizes there is a need for valid and efficient methods to screen large numbers of environmental contaminants for their potential to pose a developmental hazard. Whereas the use of SAR models for exploratory studies is encouraged, statistically based SAR models, in their current form, are not yet sufficiently developed or validated to yield confident predictions with which to identify potential developmental toxicants in a screening program. The Working Group believes that the efforts recommended in this report will contribute to improving the potential of statistically based models for this application." Similarly, Cronin et al. (2002) summarize: "There are a number of problems with applying QSARs to reproductive toxicology notably the complexity, subtlety, and sometimes ill-defined nature of the endpoints and lack of data available for modeling." Hewitt et al. (2010) conclude similarly: "This study demonstrates the limited success of current modeling methods when used in isolation. However, the study also indicates that when used in combination, in a weight-of-evidence approach, better use may be made of the limited toxicity data available and predictivity improved." Recommendations (condensed here) are provided as to how this area could be further developed in the future: -Availability of suitable toxicity data; almost exclusively data collected for pharmaceutical compounds, which may prevent the study of predictions made for industrial chemicals. -Placental transfer can be useful as a modulating factor. -Existing "global" (Q)SAR models for reproductive and developmental toxicity must be treated with caution. Given the plethora of different mechanisms (many of which are unknown) involved within reproductive and developmental toxicity, a single "catch all" (Q)SAR model is likely to show limited performance. If literature data are available, a number of structurally/mechanistically restricted "local" (Q) SARs would be more appropriate. -At present, category formation approaches are promising but they are limited, both by available data from which to select category members and by the approaches available to define categories. -Currently, the structural alert approach, as used in DEREK-fW, requires more alerts to be developed for reproductive and developmental toxicity. -The importance of time-dependent effects should also be considered.
for example, those being utilized by the EPA ToxCast program (Sipes et al., 2011a) (see below).

In silico approaches
The development of reliable QSARs for reproductive toxicity is currently suffering due to a lack of high quality in vivo data and the complexity of the reproductive toxicity endpoint, which involves several known and unknown toxicological mechanisms.
It should be stressed that QSARs can be based on either in vivo or on in vitro data. The uncertainty of the origin of data should be taken into account when integrating these models into testing strategies. Some commercially available toxicity prediction software packages are claiming to detect reproductive toxicants. Maslankiewicz et al. (Bremer et al., 2007a) have reported that the software program DEREKfW has been challenged with around 100 reproductive toxicants included in Annex I of Directive 67/548/EEC, and 90% of chemicals classified for "impaired fertility" and 81% of chemicals that cause harm to the unborn child were not detected. The TSCA chemical category list of the new chemical program of US-EPA failed in 77% to detect EU-classified chemicals causing adverse effects to mammalian fertility and 82% of developmental toxicants have not been correctly identified. This is in strong contrast to mere internal validations that show results of >80% correlation for reproductive toxicity (Matthews et al., 2007), illustrating the importance of objective assessments.
A working group of ILSI/HESI assessed structure/activity relationships (SAR) (Julien et al., 2004) and summarized: "The Working Group's investigation of two statistically based SAR systems that have been applied to developmental toxicity elucidated the difficulties in predictive modeling of this toxicity. With a statistically based approach, the activity (or inactivity) of each training set compound must be captured in a way that can be correlated with the presence or absence of chemical structural features. This poses a number of methodological challenges. The particular 'activity' representing developmental toxicity must be defined. Also, an objective, rational, reproducible, and transparent process for scoring a training set compound for the activity must be developed. Additional methodological challenges derive from the dynamic nature of development and the general sparseness of published developmental toxicity data.
To advance the potential of SAR for predictive modeling of developmental toxicity, it will be necessary to develop general scientific agreement on valid and transparent methodology for selecting, categorizing, and scoring developmental toxicity data. Such methodology should be developed by an interdisciplinary panel of developmental toxicologists and developmental biologists, working in consultation with SAR model developers and individuals with other relevant expertise (e.g., biostatisticians). The recommendations from this panel should undergo peer review.
The Working Group recommended three research efforts that will inform the development of improved methodology: 1) A systematic and holistic analysis of developmental toxicity data of adequate quality and quantity should be conducted. Toward this aim, a comprehensive, publicly available electronic database of This highlights the need for "virtual models" in which a toolbox of dynamic models can be used to interpret HTS data and pathway-based information. Latest studies from ToxCast have demonstrated the feasibility of predictive modeling of fertility, blood vessel development, and prenatal developmental toxicity (Sipes et al., 2011a). Angiogenesis can be considered an example, as cell-agent based models (ABMs) for angiogenesis have been developed that recapitulate HTS data at a histological scale (Kleinstreuer et al., 2011). In this regard, (Q)SARs may become more informed as we train these read-across methods with information from HTS data and cellular ABMs. Another approach emerging from the above mentioned data is called: "Towards a virtual embryo." The final goal is to apply HTS data, in silico tools, and models to look globally at developmental processes and toxicities in a new way. Predictive and mechanistic models would dynamically integrate data with relevant information about embryonic systems. Applying "Virtuomics" and running "what-if" scenarios to predict adverse outcomes from different perturbations might allow scientifically-based predictions of how development might be affected across a range of complex factors. A toolbox of virtual tissue models may someday comprise a modular virtual embryo for simulating important information as part of an integrated testing strategy 2 .

Information-rich single tests
Complex phenomena such as elements of the reproductive cycle and their perturbations usually can be captured better by multiple endpoints than by a single biomarker. Functional endpoints such as formed beating heart cells in the EST already integrate many biological pathways, but new technologies allow assessing a multitude of measurements using high-content technologies. Both omics and image analysis can add new qualities to interpretation of the biological models (Hartung and Leist, 2008). However, it is important to keep in mind that whatever fancy analysis we add, it can hardly overcome the limitations of the underlying model (Hartung, 2010b(Hartung, , 2011. So the same considerations of the limits of both animal and cell models apply. We also need to keep in mind that the novel technologies pose an enormous challenge to the validation process as exemplified for toxicogenomics approaches (Corvi et al., 2006). The number of parameters to control and document, the sometimes high cost per single measurement limiting replicates and numbers of substances tested, or the complex prediction models for information-rich methods, as well as the rapid turnover of technological change are only a few examples of challenges faced.
In vitro work so far has combined mainly whole embryo culture and transcriptomics (Luijten et al., 2010) or the EST with metabolomics (Kleinstreuer et al., 2011;West et al., 2010), proteomics (Groebe et al., 2010;Klemm et al., 2008;Klemm and Schrattenholz, 2004;Seiler and Spielmann, 2011) or transcriptomics, as summarized recently (van Dartel and Piersma, 2011). These approaches use patterns or biomarkers derived from a training set of substances to identify substances with similar mode of action. Their predictive value looks promising but awaits formal validation.
-A weight-of-evidence prediction is dependent upon whether a valid chemical category can be formed for read-across, (Q) SAR models, and chemical profilers for specific reproductive toxicity effects. -A need for collaboration between scientists with experience in computational modeling and those with experience in interpretation of developmental toxicity data has been highlighted. -There is value in considering more than one in silico approach within a weight-of-evidence framework. There is clearly a need for access to existing animal and human data to improve the situation. New technologies and bioinformatic methods can only be utilized if there is increased sharing of data. A number of research efforts already allow global access to information such as ACTOR, ToxRefDB, the ILSI-HESI toxicogenomics project, etc. This concept of data sharing has also been incorporated into REACH, which requires that toxicological data be made publicly available, but the summarizing data typically do not qualify for modeling approaches. The need to make data available extends to publicly recorded human clinical trials and pregnancy registries.
Complementary to this issue of globally available data is the need for consistent and universally accepted terminology for characterizing effects. Historically, the developmental toxicology community has embraced this concept, with international collaborative projects and publications on terminology used in the evaluation of fetal specimens (e.g., Makris et al., 2009;Wise et al., 1997). This same attention to consistency and precision in terminology must also be applied to new technologies for developmental toxicity testing.
Altogether, it is unlikely that in silico approaches as standalone methodologies will make a major contribution to reproductive toxicology in the near future. This is in line with some growing skepticism on (Q)SAR as stand-alone methods in regulatory safety assessments in general (Doweyko, 2004;Hartung, 2009b;Hawkins, 2004;Raunio, 2011).
In contrast to the above mentioned drawbacks of (Q)SARs, computational toxicology based on High-Throughput Screening (HTS) data, and cell agent-based models (ABMs) have been able to simulate prototype toxicity pathways that affect growth, morphogenesis, and development. In vitro profiling manages to screen for targets, pathways, and processes to build predictive signatures for discrete adverse outcomes from animal data or human epidemiology where available. Functional assays must extend these signatures to mechanistic relationships and pathwaybased inferences for an integrated testing strategy. As we increase biological knowledge, it will be necessary to build and utilize biologically informed models that can simulate downstream consequences of perturbation. In this regard, computational systems biology is needed to reconstruct higher-order biological effects from the more fundamental in vitro data. These predictive models demonstrated the feasibility of predicting ToxRefDB animal toxicity solely from the results of HTS data. In the future, it will be necessary to perform forward validation of these models without dependence on animal data (for compounds lacking such data).
The idea of a comprehensive ITS would be to provide for as many substances as possible enough information to avoid the ultimate animal test. Ideally, all aspects of the human reproductive cycle would be mapped and translated into test components. At the same time, there are some dominant findings, which lead to classification as a reproductive toxicant (see Fig. 6.5).

Integrated testing strategies (ITS)
ITS are a consequence of REACH (van Leeuwen et al., 2007), which argues for the use of all available information and views use of the definitive animal experiment only as a last resort. However, the ITS suggested for reproductive toxicity (Fig. 6.4) is relatively simple, not really accommodating any alternative methods.  Bremer and Hartung, 2004) The figure presents a breakdown of embryotoxic effects of 74 industrial chemicals, which have been tested according to EU Directive 67/548/EEC B31. Even if 34 chemicals have demonstrated effects on the offspring only 2 chemicals have been classified as developmental toxic according to the standards applied by the national competent authorities. However, by analyzing all the developmental toxic effects the data demonstrate mainly that combined embryotoxic effects have been detected, but some chemicals also induce specific effects, such as delayed ossification and other skeletal effects. It is important that the experimental design of in vitro tests will be set up in a way that these effects can be detected. of exhaustive database and literature searches, data satisfying the inclusion criteria for this analysis could not be located in the public domain for more than half (53%) of the substances classified by regulators as being toxic to reproduction. The analysis was limited to data on 71 classified reproductive toxicants. Statistically and biologically significant positive effects have been reported as absolute frequency (i.e., the total number of times a positive effect was detected in a particular sub-endpoint, irrespective of the dose at which the effect was seen). The most We have earlier suggested  using this for a prevalence-based testing strategy, creating tests for an ITS specifically addressing these aspects. This reduces mapping the human reproductive cycle to those elements which really lead to classifications. We called this a "prevalence-driven approach" (Fig. 6.6).
This concept was further refined by Bremer et al. (2007b), who studied in more detail available information on endpoints leading to classifications in various databases. Despite a number Fig. 6.6: Proposal for a test strategy in order to detect the embryotoxic hazard of chemicals (modified from Bremer and Hartung, 2004) The flow chart demonstrates a proposal for a testing strategy to detect the embryotoxic hazard of chemicals. Three tests based on embryonic stem cells and their differentiated counterparts have been combined. The reliability of the test strategy has to be tested by using selected chemicals with various toxicological pathways. It has to be proven that all toxicological mechanisms will be detected or if additional systems such as tests for receptor-mediated embryotoxicity must be included. It should be pointed out that such a test strategy should be part of a general testing scheme for toxicological profiling of chemicals. Chemicals with a known cytotoxic effect probably will not enter into this testing scheme. Chemicals that are known to be metabolized will be tested in combination with a biotransformation system. In vitro tests, in grey, have been developed, but further test optimization and validation is required.
A more detailed analysis of the same database was presented by Martin et al. (2009b): "19 highly prevalent effects identified treatment-related changes to reproductive performance including fertility, mating, gestational interval, implantations, litter size, and live birth index, demonstrating effects at different stages of the reproductive cycle. ... The fairly restricted set of 19 effects characterized 151 of the 152 chemicals that demonstrated any reproductive toxicity. Additionally, these 19 effects identified 229 of the 269 chemicals that caused any offspring toxicity. The remaining 40 chemicals not identified were predominantly affecting pup weight only. This supports the hypothesis that we can extract a small finite set of key reproductive effects from this dataset for use in developing robust predictive signatures." This strongly supports the idea that a rather limited set of critical endpoints might be mapped by either mode of action or PoTbased tests. It is hoped that with the expansion of the ToxCast program further chemical classes will be entered. The ontology of effects developed here represents on its own a very valuable tool for the field. Similarly, proprietary data could be analyzed, even in a blinded manner, to establish more robust frequencies, and companies should be encouraged to share these. The analyses of metabolites and biological pathways are likely to identify additional nodes that may be important to develop specific tests for predictive reproductive toxicity and the PoT approach is therefore important for ITS.
Ideally, not only hazard information is used for an ITS. ITS do not necessarily use only new (in vitro) test data but can incorporate in silico estimations and modeling. A promising integration of different information sources is the combination of in vitro studies with kinetic modeling (Andersen et al., 2005), which has been suggested as Biologically Based Dose-Response Modeling for developmental toxicity (Lau et al., 2000). Note also that existing data can be used, ranging from cell-based tests to animal and human data. To the extent that existing information shall be integrated into the ITS, it will be necessary to assess its quality. A tool was developed (Schneider et al., 2009) to objectively assign the so-called Klimisch scores to either in vivo or in vitro studies as a first step.
Principles for systematic ITS composition (Jaworska and Hoffmann, 2010) and validation (Kinsner-Ovaskainen et al., 2009) are only emerging. A certain consensus exists that the reproducibility of each ITS component needs to be assessed. However, it is not clear how the predictive value can be assessed without an enormous number of substances tested and in the absence of an animal model as point of reference for the components. An evaluation stressing more the scientific validity of the components appears to be a pragmatic solution. Earlier, we termed this a mechanistic validation (Coecke et al., 2007;Hartung, 2007bHartung, , 2010b, as it confirms that the model reflects a scientifically established relevant mechanism, differentiating it from an empiric reproduction of a reference test. The careful selection of reference compounds (Hoffmann et al., 2008) will become even more important.

Pathways of Toxicity (PoT) and systems toxicology
The area of reproductive toxicity testing appears to be very well suited for PoT-based approaches as currently pioneered by the frequent ones were 39 cases of body weight changes as a more general toxicity parameter, 30 cases of testicular weight/histopathology, 28 offspring body weight at birth, and 25 each for sperm morphology, sperm count, pregnancy rate, and live offspring. Interestingly, uterine weight/histopathology was on the lower end with only five cases. Most of the reported effects are not isolated, but also appear in combinations. It is, therefore, highly relevant for a further analysis, in particular for sub-endpoints occurring with a lower prevalence, to determine if they are associated with a more frequently occurring effect. Such analysis could allow focusing test development on the most relevant modes of action. These would further diminish the relevance to test for a sub-endpoint with a lower prevalence such as, e.g., parturition. Even if parturition is a sub-endpoint with a low prevalence of a health effect, which has per se a low prevalence in the universe of industrial chemicals, the competent authorities currently request testing for such an endpoint.
For developmental toxicity the search strategy described above identified reliable data for 202 of the classified substances. Given the extensive range of histopathological, functional, clinical, and other evaluations undertaken in the context of a developmental toxicity study, standardization is important not only in relation to the selection of study endpoints but also in the terminology used to communicate study results. For the purposes of this analysis, studies were analyzed and catalogued in a manner consistent with the recommendations of Chahoud and colleagues (Chahoud et al., 1999) using sub-endpoint definitions proposed by MacKenzie and Hoar (Derelanko and Hollinger, 2001). The frequency with which standardized sub-endpoints from guideline prenatal developmental toxicity and developmental toxicity studies were reported positive for the 202 substances in this database ranged from 78 for postimplantation and dead, 77 skeletal, 60 body weight, 55 external limbs and digits, etc. Offspring sex ratio (4) and parturition (2) were the least frequent. These preliminary analyses illustrate how we might be guided in developing an ITS of components most relevant for regulatory decision making.
The limited availability of full study records in the public domain impedes this approach, but the more recent data made available, for example via the ToxRefDB, might help here: Knudsen et al. (2009) characterized 283 chemicals (mainly pesticides) tested in both rats and rabbits; 53 chemicals (18.7%) had lowest effect levels on development that were either specific (no maternal toxicity) or more sensitive than the maternal animal in either species: "The primary expressions of developmental toxicity in pregnant rats were fetal weight reduction, skeletal variations and abnormalities, and fetal urogenital defects. General pregnancy/fetal losses were over-represented in the rabbit, as were structural malformations to the visceral body wall and CNS. Based upon administered doses, there was a clear hierarchy to the sensitivity and specificity of [developmental lowest effect levels] dLELs in comparing species, with rat development being more sensitive with regards to the number of endpoints affected and the number of active chemicals. Many of these relationships are consistent with previous database studies of developmental toxicology, indicating that they are driven by the biology of the test species." Similar efforts should be extended for the (favorably human) EST (Kleinstreuer et al., 2011;West et al., 2010). The obvious potential of combing stem cell methods with Tox-21c approaches was stressed earlier (Chapin and Stedman, 2009). This extends from embryotoxicity to other areas such as male fertility (Krtolica and Giritharan, 2010) and to link toxicity related biomarkers uncovered using hES cells with the PoT concept. However, the vision of Tox-21c is not that a tremendous number of assays in a centralized facility are used for each and every substance. Once critical PoT are identified, they can be translated into rather simple assays. For developmental toxicity, for example, a total of 17 intracellular pathways have been identified as involved in organogenesis (for review see: Anon., 2000), cyto-differentiation, growth and tissue renewal, of which 5 appear to be the most relevant for early development ( Fig. 6.7).
As part of a collaborative project linked to the ReProTect project, Michael Schwarz and his group in Tübingen, Germany, have established a system with ECVAM in which mouse embryonic stem cells were stably transfected with luciferase reporters specific for the Wnt/beta-Catenin and the TGF-β signaling pathways (the so-called ReproGlo assay, (Uibel et al., 2010)). The effects of several known human teratogens and non-teratogens, including thalidomide, have been investigated US-EPA: The ToxCast project has mapped a multitude of pathway assays to animal reference data: Sipes et al. (2011a) have delivered a very impressive proof of principle of the PoT concept across species. This has been started for zebrafish (Sipes et al., 2011b), demonstrating the common basis of PoT with mammals. The two most promising alternatives for hazard-based identification of developmental activity in the ToxCast battery are non-animal embryonic stem cells and zebrafish embryos; although the latter should only be seen as an interim approach until full replacement tests are available. These models, in tandem with >600 ToxCast assays, provide a unique resource for this prioritization. The performance of these test systems needs to be looked at closely within the context of ToxRefDB animal bioassay data. Early results comparing zebrafish with pregnant rat and rabbit have shown similar concordance (e.g., ~56-60%) between rat-zebrafish, rabbit-zebrafish, and rat-rabbit. As such, the need to develop in vitro extrapolation from concentration response to in vivo dosimetry, cross-species differences, and life-stage assessments is required. Although most HTS assays were based on human cells, they could distinguish PoT that are active in either rats or rabbits, explaining species differences. This shows how the change in resolution allows annotating PoT to different species and measuring them with PoT specific assays with high throughput.

Dosing: Steady state versus C max
It is not clear what would be the best approach to determining concentrations of compounds to be used in in vitro studies. Current expert opinion is that cytotoxicity information is not valuable, and that triggering of pathways in reproductive and developmental toxicity should be evaluated. Frequently doseresponse analyses for toxicity are performed and subtoxic or low doses relative to cytotoxicity are chosen. Given that the in vitro models are very different from in vivo, comparisons between these two systems are useful to assure relevance. A related concern is whether steady state concentrations in vivo are the reference or whether Cmax values might be more appropriate. Many metabolic specialists claim that high doses of compounds trigger later cellular events, which are not likely to be seen at longer-term steady state levels. These facts should be taken into consideration when developing or refining in vitro tests.

Short versus long-term effects
Evidence should be built into in vitro predictive models that allow for an understanding of the early identification of events that may take longer timeframes to be observed in vivo. If exposure to a compound in vivo takes weeks or months to produce an event, do the in vitro tests performed for shorter periods of time have the potential to identify these compounds?

Biological systems
The main interest lies in developing biological systems that model in vivo systems. Obviously, the key issue is that the systems must be relevant (and most likely based on human rather than animal materials). The more complicated systems typically move toward systems with cellular interactions and ultimately toward 3D cultures. The caution is that the compound doses delivered to the cells in these experiments need to be carefully considered. The more complex systems typically evolve to polarized cells, and it is becoming very clear that cells interact very differently with compounds delivered apically versus basolaterally (Benet et al., 2003). Therefore, the addition of compounds to such models needs to be evaluated with relevance to the in vivo situation. Both the advantages and disadvantages of 3D and complex models need to be considered.

In vivo factors
Many upstream risk factors are associated with human developmental defects as an interaction of multiple factors relating to genetics, environment, and socioeconomic status. The latter includes factors such as prenatal healthcare, maternal nutrition, anxiety, general health, and drug use/abuse. These may be difficult to unravel in vivo (adverse outcome pathways) and to quantify in vitro (toxicity pathways). As such, alternative methods need to address key molecular pathways and cellular processes that propagate information across multiple scales of biological organization in the developing embryo. Particularly important, but as yet under-represented in alternative models, is a systematic approach to characterize and analyze multicellular networks within the context of normal biological architecture. Assays that address 3D configuration and extracellular matrix biology are needed.
in this system. The undifferentiated cells are incubated for only 24 hours; the system is based on a multi-well format and thus is well suited for high-throughput analysis. It also allows the determination of non-specific toxicity (Alamar Blue assay) and the specific response (luciferase-reporter readout) on one and the same plate. The test correctly identified human reproductive toxicants such as lithium chloride, retinoic acid, the potency of different valproic acid derivatives and (with a metabolizing system) cyclophosphamide. This nicely illustrates that PoT-based assays, if representing nodes in the perturbed physiological networks, most likely can cover substantial parts of the universe of toxicants.

General considerations Machine learning and 'omics'
Considering the fact that we do not know everything, and our current knowledge base is growing quickly, the identification of pathways and nodes of biology that seem to be state-of-theart today might be outdated tomorrow. If we develop specific assays focused only on what we know, we limit our ability to uncover new mechanisms based on new compounds, mixtures, or metabolites. An example of this, even within a platform, can be seen in metabolomics. When one uses only a targeted approach, the ability to learn about new biomarkers is limited. In contrast, an untargeted approach opens the number of possible endpoints to all of the measureable metabolites. Furthermore, tests that rely on vast amounts of biomarker data, such as those obtained from metabolomic fingerprints, allow a continual "machine learning" approach of the predictive models. The utility of omics-based approaches seems to be an especially efficient manner by which to examine many biomarkers simultaneously to create knowledge bases that will allow a machine-learning approach to insure inclusion of important information.

Compound-related issues
The number of compounds with clearly known human reproductive or developmental adverse effect is relatively small; therefore it becomes difficult to compile a relevant set of compounds on which to build predictive model in vitro systems. Furthermore, assuming the efforts to predict human reproductive or developmental effects are successful, compounds that cause human reproductive toxicity will not go forward, and the set of compounds to be used as reference molecules will become self-limited. Predictive training sets need to be standardized, and to really understand the utility of training sets it is recommended that similar structures are used, especially when they segregate differently (toxic vs nontoxic). Otherwise, efforts should be made to assure that structures are as diverse as possible to facilitate maximizing the "chemical and biological space" of the predictive models. Lastly, it would be interesting to select some compounds to be predicted from the PoT key metabolites and pathway regulators. In conclusion, it is essential to have a set of compounds compiled and recommended for use as a gold standard training set. the second species in reproductive toxicity testing by FDA. Similarly, the variants of whole embryo culture should be followed up almost a decade after validation of the original protocol. 5. The advantage of using human rather than animal derived biological test systems should be taken into account for every optimization or new development of a test system that is designed for human risk assessment. 6. For in silico approaches, the ILSI/HESI Working Group recommendations are reiterated: "1) A systematic and holistic analysis of developmental toxicity data of adequate quality and quantity should be conducted. Toward this aim, a comprehensive, publicly available electronic database of developmental toxicity data is needed. Such a compilation would allow investigators to methodically examine a range of considerations when selecting and utilizing toxicological data in training sets (i.e., various experimental factors, various approaches to combining/separating categories of endpoints, and alternative scoring systems); 2) Training sets for discrete developmental endpoints should be developed. This would allow examination of the process used to assemble training sets, as well as the effect of alternative processes on the predictive performance of the model. The Working Group considers these first two recommended efforts, compiling/ analyzing a comprehensive database and developing/ investigating alternative training sets, as complementary and iterative exercises; 3) The combined use of multiple types of tools and approaches for screening should be investigated." 7. Typical alerts leading to reproductive toxicity testing from repeated dose studies or developmental toxicity testing studies should be identified in order to develop mechanistic in vitro tests to clarify the alert. 8. The analysis of findings in reproductive toxicity studies leading to classifications should be consolidated to identify modes of action to translate into test modules for an ITS. 9. ITS as Bayesian networks of mode of action tests should be formed and optimized by machine learning. 10. PoT from the most promising in vitro tests (stem cells, zebrafish, whole embryo culture) should be mapped to feed into a Human Toxome database. Similarly, analysis of samples from animal experiments might allow PoT identification using omics approaches. 11. Identified PoT should lead to specific test development, preferably HTS compatible. 12. A probabilistic risk assessment condensing the information from PoT-based tests and other sources needs to be developed.

Conclusions and recommendations: reproductive toxicity
It is probably too simplistic just to break developmental and reproductive toxicity down into a series of hazards. Issues that should be considered or addressed in developmental toxicity testing were recently listed by Makris et al. (2011): -Translational medicine, cross-species extrapolation -Mode of action data -Cumulative exposure issues -Critical windows of exposure and effect -Latency of response -Structural vs. functional outcomes This chapter very much agrees with the emphasis on mode of action, or even with finer resolution to PoT. This corresponds very well with an emphasis on functional instead of structural outcomes. It is hoped that the annotation of PoT to species will help the cross-species extrapolation. It is also hoped that the early events (points of chemical interaction) will also be predictive for the more latent manifestations. Exposure considerations have not been addressed here, with the exception of the TTC concept.
There are many considerations involved in non-animal testing for stages of the reproductive cycle, and an integrated strategy combining in vitro methods with high-throughput screening (HTS), predictive computational models, and computer simulation provides the foreseeable path forward.

Recommendations: reproductive toxicity
The following key recommendations are made: 1. The limitations of the pertinent animal test protocols for reproductive toxicity testing should be systematically reviewed in the spirit of evidence-based toxicology. 2. The TTC approaches can be further refined by distinguishing classes of chemicals or using internal TTC, i.e., basing the threshold on plasma concentrations actually achieved. An interesting option would be to use experimental barrier model data to modify the TTC level. 3. The zebrafish embryo teratogenicity assay should be evaluated for defining a protocol that will allow formal validation, although the test should be seen as an interim approach until a full animal-free replacement is available. 4. A human stem cell-based test employing either human embryonic stem cells or induced pluripotent stem cells should be validated. An evaluation of stem cell variants and prediction models should be carried out, especially since the assay is considered for possibly replacing complement to all novel approaches. New assays in this field will be used to enable quantitative in vitro/in vivo extrapolation (QIVIVE); here, the main approaches are in silico modeling and the integration of input from in vitro barrier and metabolism models. With a targeted effort, especially broadening the database for modeling, and the necessary funding, an important contribution could be expected in a few years, in line with the earlier reports' judgment (Adler et al., 2011;. Skin sensitization has seen the development of about 20 in vitro and in silico models, several of which look very promising and are currently undergoing validation. We will have to see whether and how to combine these tests in the most meaningful way.
Eventually, ITS will be set up that reflect the different modes of action and steps in the pathophysiology of skin sensitization.
The main conclusions and recommendations of the report can be summarized as follows:

Toxicokinetics
-Represents a necessary complement to all in vitro approaches to allow QIVIVE -need for "in vitro kinetics" of chemicals in the experimental systems with the goal of producing proper kinetic parameters for QIVIVE -In silico approaches need to be further optimized -Need for more comprehensive data collections, especially in vitro data from barrier models -Problems mainly in the fields of bioavailability and urinary excretion -Achievable with reasonable investment

Sensitization
-Reasonably good animal model (LLNA) capable of generating potency and dose response information -Multiple in vitro assays available but unclear which test methods provide potency information -The need to build mechanistic understanding to enable data integration for potency determination for hazard characterization & risk assessment remains an important in vitro challenge

Repeated dose testing
-Tox-21c approaches based on PoT represent the key perspective; need to focus on defining levels that cause adverse effects rather than just hazard identification -Need for data sharing from industry -Need for models for PoT identification (e.g., stem cells) -Need for co-cultures, 3D models, and long-term models -Human disease knowledge and known toxicants must be exploited Alternative approaches as one-by-one replacements of animal tests have advanced over the last two decades, and formal validation has delivered the proof-of-principle that they do not lower safety standards (Westmoreland et al., 2010). Increasingly, international acceptance of these methods is being achieved. However, currently validated tests address mainly topical and acute toxicities. The advances in technologies and the gain of toxicological knowledge also appear to make novel approaches feasible for systemic toxicities in a not-so-distant future. Based on the recent analysis commissioned by the European Commission (Adler et al., 2011) and its independent review (Hartung et al., 2011), this expert group has started to set priorities and to identify a roadmap for such a transition. The five whitepapers prepared for this purpose differ in approach and style, even after discussion, revision, compilation, and editing. The framework for a strategy to replace animal tests has only been developed during the writing of the whitepapers. It has been applied to carcinogenicity (Chapter 5) and reproductive toxicity testing (Chapter 6), but not to the other three fields.
One reason for the differences between the chapters is the different status of the areas. Reproductive toxicity testing has been pioneered by the ReProTect and the ToxCast projects, and the modes of action for genotoxic and non-genotoxic carcinogenicity appear to be more limited in number and better understood than for chronic organ toxicities. Those two areas form a group, together with repeated dose toxicity testing (Chapter 4), as all three areas are suitable for ITS and PoT-based approaches, which represent not only a departure from one-to-one replacement strategies but also a revolution of testing strategies brought about during the last decade. It appears that the large number of target tissues and modes of action will make an ITS approach difficult, requiring that these be broken down to PoT and Potbased assays, which can then be combined in a HTS platform.
The situation for toxicokinetics (Chapter 2) and skin sensitization (Chapter 3) appears to be very different from the three areas above: Toxicokinetics has to be seen more as the necessary 7 Overall Conclusions -Increased focus on modeling of inflammatory/immunological damage

Carcinogenicity
-Possible abolition of current test via an objective assessment with tools of Evidence-based Toxicology (EBT) -Important ongoing work to optimize genetic toxicity battery -Further evaluation of cell transformation assay required -ITS including non-genotoxic modes of action should be developed -Tox-21c approaches based on PoT (including metabolomics) represent a key opportunity

Reproductive Toxicity
-Analysis of current animal tests by EBT approaches -Validation of (human) embryonic stem cell test variants -Validation of zebrafish egg test for teratogenicity -Extension of ITS approaches, extending the approach of ReProTect -Extension of the ToxCast program currently pioneering PoT-based assessments -Tox-21c approaches based on PoT, especially mapping the PoT for reproductive toxicity for a Human Toxome database