Consensus Report on the Future of Animal-Free Systemic Toxicity Testing

Marcel Leist 1,2, Nina Hasiwa 1, Costanza Rovida 1, Mardas Daneshian 1, David Basketter 3, Ian Kimber 4, Harvey Clewell 5, Tilman Gocht 6, Alan Goldberg 7, Francois Busquet 1, Anna-Maria Rossi 1, Michael Schwarz 6, Martin Stephens 7, Rob Taalman 8, Thomas B. Knudsen 9, James McKim 10, Georgina Harris 7, David Pamies 7 and Thomas Hartung 1,7 Center for Alternatives to Animal testing, CAAt-europe, University of Konstanz, Konstanz, Germany; Doerenkamp-Zbinden Chair of in vitro toxicology and Biomedicine, University of Konstanz, Konstanz, Germany; DABMeB Consultancy ltd, Sharnbrook, UK; Faculty of life Sciences, University of Manchester, Manchester, UK; the Hamner Institutes for Health Sciences, Research triangle Park, NC, USA; eberhardt Karls University, tübingen, Germany; Center for Alternatives to Animal testing, CAAt, Johns Hopkins University, Baltimore, US; Cosmetics europe, Brussels, Belgium; US ePA, Research triangle Park, NC, USA; Ceetox, Kalamazoo, MI, USA


Summary
Since March 2013, animal use for cosmetics testing for the European market has been banned. This requires a renewed view on risk assessment in this field. However, in other fields as well, traditional animal experimentation does not always satisfy requirements in safety testing, as the need for human-relevant information is ever increasing. A general strategy for animal-free test approaches was outlined by the US National Research Council's vision document for Toxicity Testing in the 21 st Century in 2007. It is now possible to provide a more defined roadmap on how to implement this vision for the four principal areas of systemic toxicity evaluation: repeat dose organ toxicity, carcinogenicity, reproductive toxicity and allergy induction (skin sensitization), as well as for the evaluation of toxicant metabolism (toxicokinetics) (Fig. 1). CAAT-Europe assembled experts from Europe, America and Asia to design a scientific roadmap for future risk assessment approaches and the outcome was then further discussed and refined in two consensus meetings with over 200 stakeholders. The key recommendations include: focusing on improving existing methods rather than favoring de novo design; combining hazard testing with toxicokinetics predictions; developing integrated test strategies; incorporating new high content endpoints to classical assays; evolv- Fig. 1: The five areas of systemic toxicity testing 1 Introduction and background the discussion leading to this summary report started with the 7 th Amendment to the Cosmetics Directive 76/768/eeC 1 , which called for a complete ban of testing on vertebrate animals for the toxicological characterization of cosmetics ingredients in 2013. the european Commission asked experts to evaluate the availability of alternative non-animal methods. their conclusion that alternative methods would not be available during the next 10 years (Adler et al., 2011) met with some criticism (taylor et al., 2011), but was mostly endorsed by an independent expert group invited by CAAt also including specialists from Japan and the USA . They also noted that significant ad-vances had been made in the time between the publication of the Adler Report and the evaluation by the international group of experts. the next step, i.e., assembling experts to create a perspective for the future, was initiated by CAAt-europe in a series of commissioned white papers on sensitization, repeated dose organ toxicity, toxicokinetics, carcinogenicity and reproductive toxicity. Importantly, this work addressed a broad range of chemical testing, including also the fields of drugs, pesticides and industrial chemicals in addition to cosmetics ingredients. A workshop with 35 experts discussed these white papers. this activity resulted in the extensive report A roadmap for the development of alternative (non-animal) methods for systemic toxicity testing (Basketter et al., 2012). to involve all potential stakeholders, this 1 http://ec.europa.eu/consumers/sectors/cosmetics/files/doc/antest/(2)_executive_summary_en.pdf

Fig. 2 Timeline of events leading to this public expert consultation report
The 7 th amendment of the European Cosmetics Directive required the phasing out of animal testing to be completed by 2013. The European Commission evaluated the availability of non-animal methods and the outcome was published (Adler et al., 2011). The conclusions of that report were confirmed by independent experts . In order to outline a roadmap for further development of non-animal methods for addressing systemic toxicity, an expert consortium was convened in a series of CAAT-Europe workshops to elaborate on the issue. As a result, an extensive report on the roadmap for non-animal methods for systemic toxicity testing was published (Basketter et al., 2012) and presented for a public expert consultation in 2012 in Brussels and in 2013 in Washington at an FDA-hosted event. The present report summarizes the recommendations resulting from the public expert consultation in which over 200 experts from academia, industry and regulatory authorities were involved. Grey boxes refer to actions taken by CAAT/CAAT-Europe. ing test validation procedures; promoting collaboration and data-sharing of different industrial sectors; integrating new disciplines, such as systems biology and high throughput screening; and involving regulators early on in the test development process. A focus on data quality, combined with increased attention to the scientific background of a test method, will be important drivers. Information from each test system should be mapped along adverse outcome pathways. Finally, quantitative information on all factors and key events will be fed into systems biology models that allow a probabilistic risk assessment with flexible adaptation to exposure scenarios and individual risk factors.
Keywords: safety testing, animal-free testing, systemic toxicity, adverse outcome pathways cancer tests. These results are unlikely to be relevant at the low doses of human exposure." (Ames and Gold, 2000). Many of the attendees expressed the opinion that widespread knowledge on the limited value of certain animal studies frequently contributes to the decision by authorities to waive testing.
the lack of predictivity of animal models is particularly apparent from the field of drug development (Leist and Hartung, 2013). Only 8% of drugs entering clinical phase I (first human dose) gain approval by authorities and half of them fail in phase III 4 . US Health and Human Services Secretary Mike leavitt commented that "currently, nine out of ten experimental drugs fail in clinical studies because we cannot accurately predict how they will behave in people based on laboratory and animal studies." 5 the conclusions are easily drawn: We need human-predictive, rapid and economical methods to evaluate whether or not a compound, no matter if chemical, drug or cosmetic ingredient, is safe for intended human use.

Future safety science and pathways of toxicity
A key step in the paradigm shift in toxicology, as far as regulatory authorities are concerned, was the 2007 US National Research Council report Toxicity Testing in the 21 st Century -a Vision and a Strategy (NRC, 2007;leist et al., 2008a). It promoted the idea that the number of ways that a chemical or drug could disturb a cell is finite and can therefore be identified by appropriate screening methods. Quantitative information on the concentration-dependence of such disturbances can be used to predict the overall network of cellular regulatory reactions roadmap was presented in March 2012 in Brussels 2 by several experts in front of about 200 stakeholders from governmental organizations, academia, industry and NGOs from all over the world (Fig. 2). A second workshop, Scientific roadmap for the future of animal-free systemic toxicity testing, similar in size and scope, was organized in Washington at the FDA in May 2013 to give updates on the Basketter report and scientific advances in the fields 3 .
Each of the lectures on the five major fields still requiring better non-animal safety testing methods ( Fig. 1) was followed by one hour of intensive discussion to consolidate or improve the suggested strategies. Here we report the final outcome. This roadmap is expected to pave the way for a new toxicology that can better predict the effect of chemicals on humans, using fewer or even no vertebrate animals.

Animal models
A successful strategy to replace animal testing must take its starting point from the analysis of the current technology. the weaknesses of animal testing could then be avoided by the new approach (Fig. 3). It must be noted that rational comparisons are made difficult by the fact that almost all of the currently used animal models have never been formally validated. the rationale of their use is therefore not based on scientific data (Hartung and leist, 2008).
Some of the problems related to animal models derive from the high doses that are tested and the multiplicity of endpoints that are measured. experts claim, that "Half of all chemicals, whether natural or synthetic, are positive in high-dose rodent

Fig. 3: Problems with animal experiments
To assess the hazard posed by substances humans are exposed to, all available approaches need to be evaluated for their usefulness. The present system of animal testing needs critical evaluation of its predictive power for human safety. The limitations of animal testing, which is often considered the "gold" standard, may compromise human safety and pose an economic threat. Under such conditions, its ethical acceptability is also doubtful. low for more rational and science-based regulatory decisions by assembling information from a tailored set of tests adapted to different types of questions and scenarios of exposure and risk assessment (Fig. 5).

General forward strategies
While discussing the five specific toxicological domains, the experts identified issues relevant to all areas of in vitro methods. Joint knowledge management and sharing of expertise between different sectors, stakeholders and application domains were identified as important drivers for accelerated progress, in addition to accelerated test establishment and validation, and better use of computational toxicology methods (Fig. 6).
Sharing of data would hopefully lead to the creation of a human safety database, bigger and more complete than existing ones (e.g. OPeNtOx, ePA, tOxBANC, IMI activities). Most importantly, it would be more accessible and should be tailored for in vitro-in vivo comparisons as well as data mining by nonspecialists in bioinformatics. the focus would not be on data collection as such, but on the accessory information linked to the primary outcome data as far as mechanisms of toxicity are concerned. As there are major hurdles (e.g., intellectual property rights and industrial competitiveness issues) to be overcome, it is clear that substantial incentives must be granted to encourage industry to share their proprietary data (Fig. 7). Blaauboer et al., 2012). Deviations from normal at important control points could be related to adverse effects of chemicals and have been termed pathways of toxicity (Pot) . Individual susceptibilities to toxicant actions are determined by genetic heterogeneity of the human population (G), but also by additional environmental factors (e) (Fig. 4). the combination of high throughput screening assays with traditional cellular assays has been supplemented by in vitro-in vivo mathematical extrapolations, systems biology (computer models of cell regulation) and other approaches by many leading academic and governmental organizations to provide integrated testing strategies (leist et al., 2012a,b;Sturla et al., 2014;Hartung et al., 2013b;Kavlock et al., 2012;Andersen et al., 2011;Hartung and McBride, 2011;Bouhifd et al., 2014;Rossini and Hartung, 2012).
New approaches to safety testing require new strategies to stringent but flexible evaluation of the suitability and performance of methods. Methods suggested by the evidence-based toxicology Collaboration (european and US branches, http:// www.ebtox.com) will be helpful in this process (Stephens et al., 2013). The risk classification itself is also likely to undergo fundamental changes. At present, using a very limited number of animal tests, a chemical is classified as toxic or non-toxic (deterministic risk assessment) at a given exposure level. the much richer information provided by the new approaches and the progress of safety sciences could form a basis for probabilistic risk assessment (Paparella et al., 2013). this would al- PoT are cellular pathways of metabolism and regulation. Interference with them can lead either to adaptive or adverse (maladaptive) responses. Prediction of the outcome requires computational modeling. A toxicant challenge may trigger different responses at different concentrations, leading to various reactions in the cell. A low target site concentration (corresponding to a "no observed effect level" (NOEL) and being much lower than the "no observed adverse effect level" (NOAEL)) may not affect the normal biological function. A medium concentration (in the range of the NOAEL) may induce an adaptive stress. Whether this results in a return to normal function or to an altered biological state depends on genetic and environmental factors and their interaction (G x E). An even higher concentration (much larger than NOAEL or NOEL) might either lead to an adaptive stress response or a complete loss of function. Here, G x E factors also play a key role in the decision whether a compound leads to cell injury, morbidity and mortality.
None of the future challenges in the field of in vitro toxicology can be addressed by individual test systems. the solution will rather lie in the construction of batteries of tests to be combined in integrated testing strategies (ItS). this will need to be considered right from the conception of a test, throughout its development and especially during the evaluation of its performance. the latter evaluation needs to consider the test alone, but also in the context of the added value it brings to a test battery. More research and experimentation is required on how to build ITS. One example of a flexible, yet fully quantitative approach, is the Bayesian network (McDowell and Jaworska,

Fig. 6: General points to consider when moving forward towards new approaches for systemic toxicity testing
While discussing solutions for the five toxicological endpoints (carcinogenicity, reproductive toxicity, repeated dose organ toxicity, sensitization and toxicokinetics), the experts agreed that several suggestions apply equally to all five areas under investigation. These general suggestions are summarized here.

Fig. 7: Creation of a high quality database for relating in vivo and in vitro information
The key players (pharmaceutical, chemical and cosmetics industry, basic research and regulators) share common goals that are of high value to them. The benefit of working together should outweigh disadvantages (opening of proprietary databases). This would allow the generation of a large, high quality database of in vitro toxicity data. It should be publicly available, include rich data that informs on the mode of action of compounds and allows for in vitro -in vivo correlations. It should also be quality controlled and suitable for case studies. This can be achieved by joint projects and the common use of legacy data from hitherto proprietary in-house databases. The collection of human data by micro-dosing, from clinical trials and from epidemiological studies plays a major role.

Fig. 5: Vision of a smooth transition from current to future toxicology in safety science
It is envisaged that the types of test systems employed will change over the time course of the establishment of a new safety science. At present, complex test systems that are specific for organ functions and developmental stages are preferentially used. Only few programs use simple assays of elementary biochemical and cellular function (e.g., ToxcCast TM Program). Over time, more and more critical biomarkers of toxicity may be identified by the application of HCS (highcontent screening) and omics technologies to the complex systems, and simple test systems may suffice to measure key processes (Rossini and Hartung, 2012). Case studies, e.g., from PBPK (physiologically based pharmacokinetic modeling) and skin sensitization fields, could be used as learning models for the transition. The principles of evidence-based toxicology and the resulting quality control will lead to an accelerated method development and validation. Over time, the goal is to shift from the present deterministic risk classification to a probabilistic risk assessment.
2002; Jaworska and Hoffmann, 2010;Jaworska et al., , 2011, which has been applied successfully in the area of skin sensitization (Fig. 8). Input of the field of machine-learning is envisaged to be very important for optimal strategic designs of ItS (Hartung et al., 2013b).

Strategies to improve test systems
Alternative in vitro methods have been developed for all toxicological questions, including even the most complex fields, ranging from developmental neurotoxicity (DNt) to xenobiotic metabolism (Adler et al., 2011;Basketter et al., 2012;leist et al., 2012a, 2008bvan thriel et al., 2012;Smirnova et al., 2014;taylor et al., 2011). Many of these methods should be formally validated for immediate use, or they could form the basis for accelerated further development. Optimization of existing systems is an important part of the strategy to accelerate the implementation of a mostly animal-free safety science, in addition to the more time demanding development of entirely new methods (Fig. 9). One specific way to improve available tests is the incorporation of highly information-rich endpoints provided by

Fig. 9: Strategies to improve in vitro test systems
Test systems that already have been developed can still improve in quality and robustness to arrive relatively quickly at predictive test systems fit for regulatory use. A list of features to be considered has been compiled here.

Fig. 10: Overview of different omics technologies that can inform on chemicals' adverse outcome pathways and underlying modes of action
Omics technologies provide data-rich endpoints. The biological information flow in a cell leads from gene sequences (the code) via RNA (the messages) to enzymes and other functional proteins (the tools). Within this infrastructure small molecule metabolites may be regarded as the goods that are produced and traded. They comprise energy substrates, building blocks and signaling messengers. As there are feedback loops between all levels, the different omics technologies address these four organization levels. The disturbance of a cell by chemicals may be measured by any single technique. Combinations of more than one approach lead to a better prediction of the true human situation.

Fig. 8: Example for the use of Bayesian networks in the establishment of integrated testing strategies
LLNA (local lymph node assay) potency prediction is used here as an example from the area of skin sensitization. Information from different assays (circles) is fed into the network. The dimension of the circles represents mutual information values; the length of the arrows has no mathematical correlation. For instance, information can be obtained on how the in vivo outcome (LLNA: local lymph node assay) is predicted by physicochemical compound properties (such as molecular weight or lipophilicity (K ow )), biological assays (e.g., GARD assay or dendritic cell assay) and peptide reactivity measures (DPRA: direct peptide reactivity assay). The advantage of the approach is that it can be coupled to other networks or other assays, as they are desired and become available. The original paper (Jaworska and Hoffmann, 2010) contains all the details on the background.
eral mechanisms of sensitization are well-defined. There are three well-developed animal models (Buehler Guinea Pig test, Guinea Pig Maximization test (GPMt) and Mouse local lymph Node Assay (llNA)) currently used to identify chemicals with toxic potential. the llNA, which is already a step towards refinement and reduction of the use of animals, is the preferred method for safety assessment as it provides a quantitative value (the concentration of the chemical which causes a threshold positive response (eC3)) that can determine the potency of the sensitizer. Already over a dozen different in vitro tests to identify sensitizers have been submitted to the european Union Reference laboratory for Alternatives to Animal testing (eURl eCVAM). Currently, two of these are validated for risk assessment. these are the direct peptide reactivity assay (DPRA), based on the chemical understanding and correlation with sensitization, and the human cell line activation test (h-ClAt), based on the activation of dendritic-like cells (Bauch et al., 2011;Sakaguchi et al., 2006;Ashikaga et al., 2010). the KeratinoSense™ luciferase-reporter gene model (based on the anti-oxidant response element in the HaCat keratinocyte cell line) (Natsch, 2010;Andreas et al., 2011) has already been validated by Givaudan and is accepted by OeCD. Integrated testing strategies (ItS) will be omics technologies. Where classical methods measure only one, or few, endpoints (e.g., metabolites or gene expression levels), the new approaches can yield thousands of data points simultaneously, and provide information on a genome-wide scale (Fig.  10) and, thus, allow insights into the reaction of a network.

Specific approaches for the five toxicological endpoints still lacking validated replacement methods
A detailed strategy has been elaborated for each endpoint and described in detail (Basketter et al., 2012). the consensus meeting of the roadmap initiative highlighted specific points for immediate attention and action:

Skin Sensitization
Although the sensitization process is a disease-free state, subsequent exposures can lead to allergic contact dermatitis, the most common adverse effect of chemicals on human health. One in 5 adults suffers contact allergy to one chemical or another (Peiser et al., 2012). this area differs from others as a large amount of human data is available. Moreover, the gen- The general scheme of an AOP is illustrated in the upper panel. The AOP provides a mechanistic link between a chemical structure and the response of the organism to the chemical. At increasing levels of complexity, the xenobiotic's action is assumed to be started by a molecular initiating event, followed by cellular and organ responses that eventually explain the effect on the organism. The middle panel gives an example by depicting the events leading to skin sensitization. Understanding the underlying pathophysiology is necessary to create a set of in vitro models for all key events. The lower panel shows an example of a specific AOP for skin sensitization. Key event 1 corresponds to the molecular initiating event. Further key events are shown and each of them may be modeled in vitro. Combination of such in vitro tests in an integrated strategy (ITS) would allow comprehensive predictions for unknown xenobiotics.
-Many other areas of toxicology can follow skin sensitization as a good example where a detailed understanding of mechanisms can lead to the development of specific assays needed to identify compound toxicity. -Computational models based on quantitative structure-activity relationships (QSARs) provide promising tools to identify sensitizers, as the toxicity of the chemical is implicit in its structure. There have been major advances in QSAR models, although studies use data from the llNA rather than human data and have difficulties in obtaining accuracy in models for "moderate" sensitizers (li et al., 2007).

Repeated dose toxicity
Repeated dose testing (RDt) consists in the evaluation of a chemical's potential to cause chronic toxicity and organ-specific toxicities. Classically, tests for RDt are based on 4 (sub-acute toxicity), 13 (sub-chronic toxicity) and 26-102 (chronic toxicity) week rodent and non-rodent studies. toxicity occurs after a chemical is absorbed into the general circulation. there is great concern about the relevance of these studies performed in animals for predicting human toxicity (Basketter et al., 2012;Chen et al., 2014;Hengstler et al., 1999;Olson et al., 2000). Different organizations (FDA, ePA, eMeA) and initiatives (ReACH, tSCA and the eU Cosmetic Directive) are pushing in vitro methods in the chemical toxicity evaluation process. RDT includes chronic adverse effects on major organs. On the one hand, the assessment of RDt requires lengthy in vivo experiments, which are difficult to model in vitro. On the other hand, inter-species differences can limit the usefulness of animal data for the prediction of human hazard in this area (leist and Hartung, 2013). In vitro methods based on human cell lines may provide more human-relevant information (Pfaller et al., 2001). Biological models for different organs, e.g., liver, kidney, lung or brain, have been established, and new culture techniques, especially in form of 3D organoids, are expected to solve present issues about long-term culturing, absence of relevant inflammatory and immune cells (Hengstler et al., 2012) and availability of fully mature cell phenotypes. Stem cells, especially pluripotent stem cells, will be a major source of tissues and cells not available otherwise. therefore, research on the generation of 2D cultures and 3D tissues from stem cells is of high importance. One of the approaches in this direction is the European SEURAT-1 project (following the long-term strategic target: "Safety evaluation Ultimately Replacing Animal testing", http://www.seurat-1.eu). It started in 2011 with 50 million € joint funding from the European Commission and Cosmetics europe and is focusing on the development of nonanimal test systems in the field of repeat dose systemic toxicity following a case study approach based on the AOP concept. the tox21 consortium and the US ePA's toxCast tM activity in the USA (Dix et al., 2007;Judson et al., 2010Judson et al., , 2014 as well as other activities in europe and worldwide take similar or complementary approaches (NRC, 2007;Adler et al., 2011;Basketter et al., 2012;Judson et al., 2012;leist et al., 2012b). Key to all these activities is the concept that most late (longer term) the way forward, as each assay on its own has 80% accuracy, but if combined in an ItS, 90% accuracy can be reached (Bauch et al., 2012). this level of predictivity would perform better than the validated llNA. thus, an ItS would fully replace the existing animal models.
Skin sensitization is a field in which several formally validated methods and ItS are expected to emerge in the near future. The reasons for this are the following: first, for skin sensitization the validation process has clear anchors: this is the only toxicological domain that is based on a formally validated animal test model (llNA). Moreover, a large set of human data on positive control compounds is available, e.g., from diagnostic patch testing in dermatology clinics. Second, the mechanisms of skin sensitization are well understood, and the individual steps are amenable to modeling. third, several in vitro models that seek to mimic each single step in the pathway are already available, and they now need only to be combined in an ItS. CAAt organized a workshop on ItS using the example of skin sensitization in June 2013 in Ranco, Italy; the respective report is currently being completed.
the application of the OeCD-promoted concept of "adverse outcome pathways" (AOPs) to skin sensitization is relatively straightforward. Virtually all key events of the AOP already are covered by in vitro assays (Fig. 11) 6 . Despite this favorable situation, validation of a complete ItS for skin sensitization will require further work. It is, for instance, not yet clear how the individual tests that cover the steps of the AOP will be combined, including how much weight is given to the results of each assay and how the decision points of tiered testing would be structured. The final prediction model must be built as a whole on the assembly of tests and on the ItS rules linking them. the process of building and optimizing this overall test strategy is made difficult by the fact that the LLNA, even though it is one of the most advanced in vivo methods, can yield false-negative and false-positive results. Despite these weaknesses, and although human data are available as an alternative reference point, the llNA is the only accepted reference for the determination of potency and for providing background data for ItS validation.
the conclusions on the status and roadmap for skin sensitization testing are as follows: -Many non-animal methods for skin sensitization testing have been proposed and some of them have been/will be validated for the purpose of hazard identification. The development of non-animal methods for the evaluation of the relative skin sensitizing potency of contact allergens will require more work. -Better measurements and tests for exposure are needed, and little is known about how to assess mixtures yet. -Complications may arise when there is a need to test hydrophobic compounds or formulations as the proposed models may not be adequate. these problems must be tackled sooner rather than later and the applicability of each model should be assessed accordingly. this will provide opportunities for the development of other assays with other applicability domains.
In summary, repeated dose toxicity will probably be the last method to be replaced, the use of Pot and new culture systems combined with new technologies and sharing of data on pharmaceutical case studies could be the opportunity to reduce the need for such expensive and long-term studies.

Toxicokinetics and quantitative in vitro -in vivo extrapolation (qIVIVE)
to relate data from non-animal test systems to the human situation, the in vitro concentration levels need to be correlated with the real exposure in vivo. Procedures for such extrapolations (qIVIVe) have been established (Fig. 13). the starting point is a determination of the "real" toxicant concentration that a cell is exposed to. this may be different from the nominal concentration, because of evaporation, metabolism, binding to plastic or uneven distribution in cells. Next, a physiologically-based pharmacokinetic (PBPK) model would be constructed for absorption and distribution in the whole organism, followed by metabolism and excretion. In vitro test systems to predict drug effects of chemicals will be predicted from the early changes they cause in cellular signaling and regulation Blaauboer et al., 2012). therefore, signaling pathway identification and analysis is a crucial research necessity in toxicology, and very detailed quantitative information needs to be derived (Fig. 12) to use such data for systems biology modeling (Jennings et al., 2013;Krug et al., 2014;. toxicogenomics technologies (Ramirez et al., 2013) are important tools that cover a multitude of cellular events. However, it is important to apply them to the right biological models. For instance, monocultures can hardly model the inflammatory responses frequently seen after long-term exposure to hazardous chemicals.
For repeated dose toxicity, two different approaches are taken in the development of alternative methods: (a) substitution of animals by a battery of relatively complex surrogate models that reflect important features of target tissues and organs. they often use 'apical' phenotypic endpoints (e.g., cell death markers) as readouts; (b) an integrated and tiered systems biology approach based on mechanistic endpoints and using the vast knowledge on biological regulation and homeostasis. Pathways-of-toxicity (Pot), emerging from such approaches, will guide hazard evaluation and risk assessment when combined with toxicokinetics modelling (Hartung and McBride, 2011;Boekelheide and Andersen, 2010). the two types of approaches may also be combined.

Fig. 12: Illustration of the different deviations of signals (physiological cellular responses) that need to be measured by modern in vitro methods
The normal cellular response is shown in blue. This is meant to symbolize any cellular function, such as a muscle contraction, an electrical signal in neurons or the regulation of glucose. Red and white curves exemplify different toxic responses. The examples show that key parameters need to be measured at high temporal and spatial resolution and over many concentrations to be sure the whole range of toxicological reactions is covered. "Toxicity" is in many cases not a simple decrease or absence of a response, but too much or wrong timing can be equally problematic.

Fig. 13: Schematic explanation of quantitative in vitro -in vivo extrapolation (qIVIVE)
The qIVIVE procedure is considered a pivotal step in the use of in vitro data for the risk assessment process. In vitro toxicity assays provide a benchmark concentration (BMC), i.e., a concentration above which a chemical is considered to be toxic in this system. The BMC is used as the point of departure (POD) for further qIVIVE steps. It allows the calculation of the corresponding human plasma concentration (PC). By taking into account in vitro data on metabolic conversion, human physiology and metabolic parameters, the human equivalent dose can be estimated. This is the starting point of the risk assessment process.
ductive toxicity testing has shown high background variability (even among untreated control animals) and is characterized by low species concordance (toxCast™, for example, showed 60% concordance between rat and rabbit studies; and 56% concordance between zebrafish and rat). In some cases, to overcome low sensitivity, studies in a second species may be requested by regulators. However, the two-species approach increases both the cost of the studies and the false-positive rates dramatically (Hartung, 2009). For this reason, in 2009, a revision of the ReACH legislation reduced the use of a second species.
Further progress in this area would be accelerated by regulatory steps that preclude the use of in vivo data unless they come from a formally validated model and therefore have a known predictivity (Carney et al., 2011). Uses of the zebrafish assay (Selderslaghs et al., 2011;Padilla et al., 2012;truong et al., 2014), the embryonic stem cell test (eSt) (van Dartel and Piersma, 2011;Seiler and Spielmann, 2011) and further developments on the basis of ReProtect test systems (Piersma, 2010) could immediately fill the gap until assays based on human cells become available. The field is developing very dynamically, and, especially in the area of developmental neurotoxicity, many new test systems are emerging (Zimmer et al., 2014;Smirnova et al., 2014;Bal-Price et al., 2012). the new assays will need to be assembled into an advanced test battery using concepts of ItS design (Fig. 15).
The ReProTect project assembled 35 European partners from academia, SMes (Small-Medium enterprises) and governmental institutes in order to develop in vitro reproductive toxicity approaches (http://www.reprotect.eu/). The scientific problem of identifying non-animal test methods in this field was addressed (Hareng et al., 2005). The project was based on a battery of in vitro methods that covered different steps of the reproductive cycle (Fig. 16). In a so-called "feasibility study" conducted at the end of the project, 10 blinded chemicals were tested by the consortium. effects on 3 endpoints, namely male fertility, female fertility and embryotoxicity were predicted. the results of the feasibility study demonstrated that the vast majority of the predictions made were correct (Schenk et al., 2010). metabolism or certain distribution parameters can provide data for such modeling (Vinci et al., 2012;Gebhardt et al., 2003) but better assays are still required for local specialized metabolism, distribution mediated by transporters, and for excretion processes (e.g., in the kidney). Altogether, this area is far advanced, e.g., for drug development, but its general application for chemicals requires further development (Bessems et al., 2014). Detailed case studies are required to explore the performance of currently available methods (Fig. 14).

Reproductive toxicity
Reproductive toxicology, including developmental toxicology, is a particularly difficult field as far as animal-to-human predictions are concerned (Knudsen et al., 2011;Makris et al., 2011). Reproductive toxicity aims to assess possible hazard to the reproductive cycle, with a high interest in the early stages of embryonic development (embryotoxicity). tests like the two-generation study are among the most costly and require up to 3,200 animals per substance (Hartung, 2008;Rovida and Hartung, 2009). this makes it impossible to test the enormous amount of chemicals present in the market, leading to a lack of information on reproduction and development toxicity of tens of thousands chemicals. Moreover, animal-based tests offer little mechanistic insight into a chemical's toxic mode-of-action (MoA) Knudsen, 2013). Animal repro-

Fig. 15: The roadmap for animal-free reproductive toxicity predictions
In the area of reproductive toxicity the experts suggested, in addition to the points summarized in Figures 4 and 6, to include several specific measures and research lines to be followed.

Fig. 14: Roadmap to animal-free toxicokinetic predictions
The experts in the area of toxicokinetics identified research areas requiring further work to obtain human-relevant toxicokinetic data on xenobiotics independent of animal experiments. not positive (Basketter et al., 2012). The correlation of findings in rats and mice is less than 60%, even less if the site of cancer in the organism is considered. the experts suggested a thorough evaluation of the test, taking into account the principles of evidence-based toxicology (Hoffmann and Hartung, 2006b). this might lead to an abolition of the in vivo assay (Fig. 17). Chemical carcinogenicity may be based either on genotoxic or nongenotoxic (epigenetic) mechanisms (Oliveira et al., 2007). Alternative methods for the determination of genotoxicity have been in use for over 40 years. An ItS has been suggested to combine such available methods (Pfuhler et al., 2010;Aldenberg and Jaworska, 2010). testing for non-genotoxic carcinogens has proven more difficult, but good results have been obtained recently by various cell transformation assays (Vanparys et al., 2011). A combination of mutagenesis assays, tests for DNA damage, cell transformation assays and targeted tests for frequent epigenetic In the follow-up European project ChemScreen (http://www. chemscreen.eu) 12 chemicals were tested for embryotoxicity in a final performance test. The battery correctly detected 11 out of 12 compounds tested. the consortium concluded that "this study illustrates added value of combining assays that contain complementary biological processes and mechanisms, increasing predictive value of the battery over individual assays" (Piersma et al., 2013).
In silico models (e.g., the US EPA̓s Virtual Embryo project) have also shown potential application in the reproductive toxicology field. It is expected that ultimately a computer model that simulates cellular function in the growing embryo can be used to determine the effects of teratogens. Some promising first results come from an in silico modeling platform: A novel multi-cellular agent-based model (ABMs) of vasculogenesis using the CompuCell3D (http://www.compucell3d.org/) modeling environment supplemented with semi-automatic knowledgebase creation has been developed by ePA. Dynamic cell ABMs have been shown to simulate complex developing systems and, consequently, display a potential to simulate adverse effects (Kleinstreuer et al., 2013;Hester et al., 2011;Shirinifard et al., 2013) and aberrant tissue fusion (Ray and Niswander, 2012).

Carcinogenicity
At present, the carcinogenicity hazard of chemicals is determined by a costly and lengthy animal test, the "cancer bioassay", although its relevance for human health is seriously doubted (Alden et al., 1996;Knight, 2007;Gottmann et al., 2001). Results of more than 3.500 cancer bioassays, which cost about € 800,000 per substance and species, are publically available: 53% of all substances tested were positive, suggesting an enormous falsepositive rate, but still some accepted human carcinogens were

Fig. 16: Examples of a test battery addressing a highly complex toxicological endpoint
The reproductive cycle with its four main phases is the target of reproductive toxicants. The FP6 EU project ReProTect established an in vitro test battery for reproductive toxicity testing covering the reproductive cycle with a series of individual tests. Each test system covers a small part of the reproductive cycle. The names of the different tests are depicted outside and inside the circle, indicating which part of the developmental process is modeled. For full explanation see http://axlr8.eu/ axlr8-2010-progress-report.pdf

Fig. 17: Roadmap for animal-free carcinogenicity predictions
In the area of carcinogenicity the experts suggested, in addition to the points summarized in Figures 4 and 6, to include several specific measures and research lines to be followed. mechanisms (e.g., nuclear receptor activation) will most likely form the basis for a future ItS. Most elements are available in some form, but they will require further development and optimization for satisfactory predictivity (Fig. 17).
7 Evaluation of test system performance evaluation of test system performance has classically considered three aspects (Fig. 18): (1) the technical reliability of the test; (2) the scientific background, rationale and scope; (3) the correlation of test data with a gold standard (e.g., animal data). the latter point has also been called predictivity (Hartung et al., 2004;Hoffmann and Hartung, 2006a;Moore et al., 2009). the validation procedure has until now followed very strict and rigid rules in the field of chemical testing. This has led to high costs and long delays before new assays were introduced. Moreover, the definition of assay predictivity on the basis of animal data has proven to be problematic because of the shortcomings of the in vivo experiments. therefore, new validation concepts have to be considered. For instance, high throughput screening assays need to be treated differently from other tests as, e.g., ring trials cannot be performed when certain robotics equipment is available only in one place (Judson et al., 2013). In cases where predictivity cannot be determined from correlation studies, en-

Fig. 18: New validation approach for novel toxicity tests
At present, test validation relies on three pillars: reliability, scientific basis and predictivity. Predictivity has in practical terms been determined by the correlation of in vitro test results with animal data. This approach is not possible for many of the toxicity domains discussed here and many of the assays that are developed are part of a test battery. Future validation must therefore rely on two pillars: Even more focus is required on test quality (reliability). Moreover, the scientific basis of a test needs to be broadened to provide a rationale for the predictive capacity of the test, not based on statistical correlation but based on scientific (mechanistic) explanations.

Fig. 19: Vision for the future of toxicity testing
The current approach is first to test unknown chemicals in animal tests. This limits overall throughput and leads to a high rate of false positives and false negatives. Mixtures are hardly ever tested because of the limitation of resources. High costs in combination with a low predictivity lead to many cases of "no testing". Mechanistic studies are only carried out in few cases of particular interest to identify the factors causing the toxicity. The new approach, suggested here, is based on 21 st century in silico and in vitro methods identifying PoT. This will in most cases lead to an amount of data that is sufficient to decide whether a substance is toxic (positive) or non-toxic (negative) for the intended scenario. Only in few cases, when not enough information can be obtained, will animal tests be performed as an additional source of information. Good information can be provided on all chemicals and, due to the high throughput of the approach, also on mixtures.
tirely different approaches need to be considered. Mechanistic validation, i.e., focus on the scientific background and consistency of an assay (Hartung et al., 2013a;leist et al., 2012a), is the most promising general option. For specialized assays, e.g., within a test battery or as the basis of a screening approach, the test performance will also need to be judged according to its fit for the specific purpose (Fig. 18). The concept of flexible evaluations of test system performance has been pioneered in the field of drug discovery. Fit-for-purpose evaluation and rating of test predictivity based on the scientific rationale of a test are commonplace in this field, and this experience could help to establish faster and more efficient assay evaluation procedures for chemicals, cosmetics and pesticides. In the discussion on test system predictivity it is important to keep in mind that test reliability will always be a necessary, absolutely mandatory condition, whatever evaluation process is chosen. the progress and acceptance of non-animal based testing will depend on this criterion and its strict implementation in the field (Leist et al., 2010).

The future of toxicity testing
the participants of the roadmap consensus symposium envisaged that two key features will distinguish the future from the present toxicity testing (Fig. 19). First, the present animal-based testing (sometimes followed by in vitro tests to supply mechanistic information) will be substituted by ItS using in vitro and in silico approaches (sometimes followed by animal tests, where further data are needed). Second, according to the vision for a new toxicology, data will be generated for every chemical and possibly also for important mixtures. this contrasts with the present situation, in which hardly any data is available on many chemicals (Crofton et al., 2012) and many tests are being waived. Moreover, a large part of the available data lacks mechanistic background and consistency controls. It therefore cannot be used to supply information concerning adverse outcome pathways. Despite the many shortcomings and, in particular, a lack of formal validation, animal data are still being used as gold standard. this contributes to an underestimation of the success of non-animal methods. the expectations regarding scientific validity and predictivity are usually higher for alternative methods than for the respective in vivo models. this is a key issue for the roadmap initiative, as the primary goal is to provide methods that are as good as animal models (not better). this goal might already have been reached in some areas. the future goals would then be to further improve the quality of safety testing beyond that of animal experiments.
As detailed already in the NRC report on the vision for toxicity testing in the 21 st century (NRC, 2007;leist et al., 2008a), the radically new approach to risk assessment has large economic and scientific advantages over the present animal-based system. the roadmap outlined in this overview highlights important steps towards this goal. All experts agreed that considerable work is still needed, but there was also strong consensus that already an impressive advance has been achieved and that the goal is well worth the required efforts.