Developing Context Appropriate Toxicity Testing Approaches Using New Alternative Methods (NAMs)

In the past 10 years, the public, private, and non-profit sectors have found agreement that hazard identification and risk assessment should capitalize on the explosion of knowledge in the biological sciences, moving away from in life animal testing toward more human-relevant in vitro and in silico methods, collectively referred to as new approach methodologies (NAMs). The goals for implementation of NAMs are to efficiently identify possible chemical hazards and to gather dose-response data to inform more human-relevant safety assessment. While work proceeds to develop NAMs, there has been less emphasis on creating decision criteria or showing how risk context should guide selection and use of NAMs. Here, we outline application scenarios for NAMs in different risk contexts and place different NAMs and conventional testing approaches into four broad levels. Level 1 relies solely on computational screening; Level 2 consists of high throughput in vitro screening with human cells intended to provide broad coverage of possible responses; Level 3 focuses on fit-for-purpose assays selected based on presumptive modes of action (MOA) and designed to provide more quantitative estimates of relevant dose responses; Level 4 has a variety of more complex multi-dimensional or multi-cellular assays and might include targeted in vivo studies to further define MOA. Each level also includes decision-appropriate exposure assessment tools. Our aims here are to (1) foster discussion about context-dependent applications of NAMs in relation to risk assessment needs and (2) describe a functional roadmap to identify where NAMs are expected to be adequate for chemical safety decision-making. This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is appropriately cited. 1 https://www.ifado.de/toxicology/2015/12/04/seurat-1-painting-the-future-animal-free-safety-assessment-of-chemical-substances/ 2 https://www.eu-toxrisk.eu/page/en/about-eu-toxrisk.php


Introduction
The National Academy of Sciences (NAS) report in 2007, "Toxicity Testing in the 21 st Century: A Vision and A Strategy," proposed fundamental changes in chemical risk assessment, including moving to human cells, tissues, or cell lines, developing high-throughput methods for evaluating large numbers of chemicals more efficiently, and using various computational chemistry and bioinformatic tools for data analysis and prediction of risk (NRC, 2007). The National Center for Computational Toxicology (NCCT) at the US EPA had previously developed a plan to incorporate many of these approaches in toxicity testing, as described in "A Framework for a Computational Toxicology Research Strategy" (Kavlock et al., 2003) and after publication of In the past 10 years, the public, private, and non-profit sectors have found agreement that hazard identification and risk assessment should capitalize on the explosion of knowledge in the biological sciences, moving away from in life animal testing toward more human-relevant in vitro and in silico methods, collectively referred to as new approach methodologies (NAMs). The goals for implementation of NAMs are to efficiently identify possible chemical hazards and to gather dose-response data to inform more human-relevant safety assessment. While work proceeds to develop NAMs, there has been less emphasis on creating decision criteria or showing how risk context should guide selection and use of NAMs. Here, we outline application scenarios for NAMs in different risk contexts and place different NAMs and conventional testing approaches into four broad levels. Level 1 relies solely on computational screening; Level 2 consists of high throughput in vitro screening with human cells intended to provide broad coverage of possible responses; Level 3 focuses on fit-for-purpose assays selected based on presumptive modes of action (MOA) and designed to provide more quantitative estimates of relevant dose responses; Level 4 has a variety of more complex multi-dimensional or multi-cellular assays and might include targeted in vivo studies to further define MOA. Each level also includes decision-appropriate exposure assessment tools. Our aims here are to (1) foster discussion about context-dependent applications of NAMs in relation to risk assessment needs and (2) describe a functional roadmap to identify where NAMs are expected to be adequate for chemical safety decision-making.
gether with significant advancements in data science and in vitro technologies, the toxicology and risk assessment communities are better equipped to approach the realities of NAM-based risk assessments. New assays examining a broader palette of possible responses (e.g., "omic" technologies and high-content imaging) are now being discussed, and tiered approaches are being developed for the use of these test platforms to streamline toxicity testing (Thomas et al., 2013(Thomas et al., , 2019. One aspect emphasized in the 2007 NAS Toxicity Testing report was that collecting data on hazard should be tied to risk decision contexts. The original figure (S-1) from the report (NRC, 2007) describing components of the vision had three parts -Chemical Characterization, Toxicity Testing (including both toxicity pathway evaluations and targeted testing) and Dose-Re-to maintain a list of acceptable NAMs. More recently, a US EPA memorandum published in September 2019 officially announced a commitment to reduce its requests for and funding of live mammal studies by 30% by 2025 and to eliminate all live mammal study requests by 2035 (Grimm, 2019).
The primary testing initiatives following the release of the 2007 NAS report focused on screening large numbers of compounds with existing high-throughput assays (e.g., ToxCast, Tox21), many of which were repurposed from pharmaceutical applications (Judson et al., 2010;Reif et al., 2010). These efforts developed the infrastructure necessary for collection and analysis of large-scale data and determining the utility of existing methods for supporting chemical safety decisions. Now, with lessons learned about the practicalities of high-throughput screening, to- Fig. 1: A multi-level strategy for using new alternative methods and higher-throughput exposure tools for context dependent safety assessments This tiered testing strategy is problem formulation-driven. The level of information available about chemicals will guide the particular testing required for any use conditions. The progression through different levels (orange arrows) is governed by decision context. Depending on the marginof-exposure (MOE) estimated at each level, a decision-maker might still regard the information at any specific level to be insufficient, leading to consideration of higherlevel testing to refine the analysis and have a greater degree of confidence in any decision. More detail on the NAMs at each level is in the text.
Inferences about likely risks or safe usage of compounds at each of these levels underpin decisions regarding safety for specific use conditions or the need for further testing. A variety of considerations, such as the magnitude of the MOE and the accuracy, regulatory acceptance and biological coverage of the assays populating the level, would have to be considered in deciding if higher-level testing would be necessary.

Looking at decisions at each level
The levels are characterized, in part, by the investment in information generated by each level and by confidence in the results. Gaining confidence in their use will be essential to make NAMs acceptable by regulatory agencies. The US EPA strategic plan based on the TSCA has three important components that are (1) identifying, developing and integrating NAMs for TSCA decisions; (2) building confidence that the NAMs are scientifically reliable and relevant for TSCA decisions; and (3) implementing the reliable and relevant NAMs for TSCA decisions 3 . Approaches to establish confidence (validation) will need to be developed. Issues of confidence affect NAMs at all four of our testing levels.
Output of testing and exposure assessment at each of these levels are potentially suitable for informing different decisions. In "Level 1 -Computational screening," high-throughput predictions of exposure, putative toxicity and expected metabolites, etc., are obtained using computational methods and would support chemical categorization and decision-making in limited contexts. For instance, when choosing among several lead chemicals for any particular application, compounds predicted to have higher exposures or carrying possible toxicity liabilities as determined by computational methods could be dropped from further consideration. NAMs populating this level are purely computational and support prioritization for further study or decisions that the chemical is unsuitable for intended applications (Thomas et al., 2013).
With indications of higher expected exposures or indications of possible toxicity at Level 1, testing would be completed and compounds triaged or deprioritized. For some compounds already in commerce or those moving forward in development due to favorable use characteristics, further work would be required to refine exposure potential and determine activity in "Level 2 -High throughput (HT) in vitro screening." This level would be populated by rapid, high-throughput dose-response screening of compound bioactivity and high-throughput in vitro-in vivo extrapolation (HT-IVIVE) (Yoon et al., 2015). HT-IVIVE converts active concentrations from an in vitro assay (e.g., Tox21, Tox-Cast) to a human equivalent dose, i.e., to a human dose or human exposure that would be expected to produce concentrations in an exposed person equal to the active concentration from the in vitro study (Rotroff et al., 2010;Sipes et al., 2017;Casey et al., 2018;Wambaugh et al., 2018). The NAMs populating Level 2 would optimally provide quantitative measures of response, sponse and Extrapolation Modeling. These three were surrounded by a circle identified as: "Risk Contexts" and "Population and Exposure Data." In 2009, the NAS produced another report on opportunities in exposure science (NRC, 2009). The linkage between assessing toxicity and biological activity with NAMs and the use of higher-throughput methods for exposure assessment provide the basis for developing testing approaches more suited to answering diverse risk assessment questions.

NAMs and risk-based decisions
Converting computational approaches and in vitro test results to expected potency of test compounds for specific in life responses is more complex than with traditional animal tests. The more straight-forward uses of these tests are to provide indications of expected in life responses based on chemical properties or in vitro "hits" or to predict expected exposures. For these reasons, most recommendations for early implementation of NAMs focused on prioritization: identifying chemicals with higher potential for toxicity for more in-depth evaluation or removing chemicals entirely from further consideration. Higher priority compoundsthose with some perceived hazard for specific types of adverse responses or with higher exposure potential -might be escalated to additional testing or, in some cases, to traditional animal-based methods depending on the decision context. Conversely, materials with higher perceived risks might simply be dropped from further consideration for development or removed from commercial use. It is important in developing more explicit prioritization schemes to include criteria that allow decision-making that avoids simply having lower priority compounds set aside awaiting extensive testing once the higher priority compounds move through more comprehensive testing of toxicity pathways or on to in life testing. For instance, after identifying and testing the high priority compounds, what is the strategy for moving on to those defined as lower priority? Our goal in developing these different test levels has been to minimize the need for higher tier testing through adherence to a risk context-based implementation of NAMs. Figure 1 depicts four levels of testing that focus on use of NAMs (including both assays of biological activity and higher-throughput exposure methods) to develop information sufficient for decision-making. Each level has NAMs for assessing both biological activity and expected human exposure. Relative safety of product usage is estimated by calculating margins-of-exposure (MOEs), a ratio of a measure of expected potency divided by a measure of expected exposure in a population. This paper considers the question of when information available from any one of these levels would be considered sufficient for risk-based decisions and the kinds of decisions possible at the various levels. Light blue boxes to the left describe level-appropriate approaches for assessing bioactivity and for estimating expected exposure. The information provided by NAMs at each level allow calculation of an MOE (orange boxes to the right). es discussed here also apply to ecological risk evaluations, recognizing that the assays, IVIVE tools and exposure evaluations would need to be tailored to support ecological risk assessment decision contexts.
Our context-dependent testing approach is applicable to a wide chemical space. However, the structure and physicochemical properties of compounds may be challenging in short-term NAMs and will determine which assays are possible and how testing should proceed. For example, modeling or prediction of compounds that are highly lipophilic and hence slowly eliminated from the body is extremely difficult. These compounds are also difficult to keep in solution for testing, as they adsorb on surfaces or form micelles. Most of the training sets for in silico tools were developed based on pharmaceuticals with a narrow range of physicochemical/metabolic properties, not chemicals with much broader physicochemical properties like the ones in ToxCast (Moreau et al., in preparation). Measuring metabolic rates of slowly cleared compounds is also challenging, and better approaches need to be developed with an eye toward defining a chemical space for which existing computational tools can predict metabolism with reasonable accuracy and an understanding about when these tools are inadequate for low-tier decision-making (Moreau et al., in preparation). Nonetheless, the testing and exposure assessment across these four levels should work with most compounds.

Level 1: Computational screening
We can next ask what NAMs might be involved at each of these levels. The optimal suite of computational tools in Level 1 should estimate physical properties, infer possible toxicity (e.g., QSAR platforms, threshold of toxicological concern, etc.), predict or take into account likely metabolites, and assess exposures that would be expected to arise during anticipated use conditions. The goal in Level 1 is to have computational tools that are developed with as large a range of compounds as possible in order to have confidence when calculating similar properties for a new compound or new class of compounds. The need for breadth of coverage is a challenge for model developers when data are not available to create models for specific endpoints or when data covering the domain of structural applicability is sparse. Even when data are available, they are often in need of curation, which is costly and time consuming.
Currently a variety of tools are available for estimating physicochemical properties and environmental fate endpoints: e.g., EPA's EPI Suite and ECOSAR (USEPA, 2018a), for predicting toxicological endpoints: TIMES (Mekenyan et al., 2004) and Leadscope (Roberts et al., 2000), for metabolites likely formed in vivo: (Leonard et al., 2018), Meteor Nexus (Marchant et al., 2008), BioTransformer (Djoumbou-Feunang et al., 2019), and ADMET Predictor ® , and for both thresholds of toxicological concern (TTC) and possible exposure levels: (Patlewicz et al., 2018). With the collaborative estrogen receptor activity prediction project (CERAPP), large-scale modeling using 32,464 structures showed the possibility of screening large libraries of chemicals using a consensus of different in silico approaches (Mansouri et al., 2016). This approach has also been used to identify androgen active chemicals (Manganelli et al., 2019). such as AC50 (a concentration causing 50% of maximal change in the assay results) or LEC (the lowest effective concentration) and, based on active concentrations in these assays, permit estimation of human equivalent doses (HEDs) (Wetmore, 2015). The MOE would be the ratio of a measure of the HED divided by expected exposure levels. A decision-maker would have more confidence in the MOEs arising from these studies than the comparisons of estimated exposure and putative risks predicted from Level 1. Nonetheless, it would be difficult to be entirely comfortable making risk assessment decisions at Level 2 for compounds for which the estimated MOE was either not sufficiently large; where the presumed MOAs inferred from these assays -e.g., MOAs -reproductive, developmental, carcinogenic potential, etc. -increased the level of concern; or where high exposures were expected in a potential target population.
The testing at Level 2 with read-outs from multiple HT in vitro assays or from limited broad-coverage assays should be designed to provide information on MOAs. Based on presumptions of MOAs from these assays, "Level 3 -Fit-for-purpose assays and safety assessment" would apply human-relevant fitfor-purpose (FFP) assays to provide more in-depth examinations of MOA-related cellular perturbations in cell systems . FFP assays would be designed with read-outs that represent key signaling processes for cellular pathways associated with a chemical's MOA and ideally include markers that correlate with or directly measure adversity. Optimally, dose-response data from the FFP assays, together with computational pathway models (Bhattacharya et al., 2011), could provide a mechanistic understanding of the shape of the dose-response curve and support more informed extrapolation to relevant human exposure. Quantitative IVIVE (QIVIVE), accounting for human relevant metabolism coupled with dose-response relationships from the FFP assays, would provide more confidence in estimated MOEs and determination of regions of safety, i.e., exposure concentrations at which no increased risk is expected in a human. Depending on the MOE obtained with these more comprehensive FFP assays and better knowledge of use-specific exposures, a decision-maker might still deem this information to be insufficient, leading to consideration of more complex assays and more detailed compound-specific exposure information at "Level 4 -More intact systems." The opportunity to accumulate relevant and context-dependent information at each level should substantially reduce the number of chemicals tested in these more complex NAMs. And, when necessary, alternative multi-dimensional and multi-cellular assays would assess human tissue-based dose responses rather than moving to studies in animal models. In addition, MOA information provided from HT testing and FFP assays could support more limited testing in animal models targeted to the specific MOA. It bears emphasis that, depending on the decision context, e.g. lead candidate selection, prioritization for remediation, ranking liabilities of possible substitutes, estimating risks with compounds lacking signals for endpoints of high regulatory concern and formal regulatory decision-making, etc., compounds would not have to be tested sequentially at all four levels in a tiered approach. While the emphasis here is on NAMs for human health risk-based decision-making, the principles and approach-2 assays and/or similarity to chemicals with known hazard profiles can be used to justify a finding that further study is or is not warranted. Furthermore, in our conception of Level 2, ADME (absorption, distribution, metabolism, elimination) data would also be integrated with bioactivity assay data to convert AC50 or LEC values to HEDs using in vitro to in vivo extrapolation methods and compared with exposure estimates to generate an MOE. An important challenge in interpreting responses in Level 2 assays relates to distinguishing biologically relevant pathway responses from the "burst effect" that can arise from substances that lack specific affinity for cellular pathways and that, at relatively high concentrations, elicit broad low-affinity non-covalent interactions, trigger cell stress pathways, or cause physical disruption of proteins or membranes (Judson et al., 2016;Shah et al., 2016).
There are now wider discussions about using cell-based assays designed to broadly examine gene expression using high-throughput transcriptomic analysis (Grimm et al., 2016;McMullen et al., 2019) and to assess cellular morphology using high-content imaging (HCI) (Vantangoli et al., 2016). Gene expression platforms such as BioSpyder 6 have reduced the cost for whole genome differential gene expression analysis (DGEA) (Yeakley et al., 2017). Benchmark dose analysis (Thomas et al., 2007) and pathway visualization methods for MOA analysis, such as mode of action visualization software MoAviz (Andersen et al., 2018; in press), assess both potency and biological functions/pathways affected by treatment. HCI platforms can be automated (Feng et al., 2009;Bray et al., 2016Bray et al., , 2017 to query a wide range of cellular phenotypes, and linking the two assay platforms DGEA and HCI could provide the necessary link between transcriptomic signatures, cellular phenotype and MOAs. The DGEA and HCI can be regarded as pathway-agnostic methods; analysis of results of the assays gives an indication of MOA rather than using inferences from Level 1 to design more MOA-targeted assays. While there is significant enthusiasm for broad coverage assays that are not directly based on knowledge of MOAs, there is as yet no consensus on cell types or duration of exposure for these transcriptomic studies. To the extent that these second-generation assays (Thomas et al., 2019) are successful and their use to assess affected biological pathways becomes more widespread, the developed information can be merged with available databases, including those from ToxCast, and be used to develop DGE-signatures for the well-studied compounds in Tox-Cast Phase I and Phase II. The results of broad coverage, pathway-agnostic assays in Level 2 should ideally allow for (a) identification of AOPs/MOAs activated by a test substance and (b) IVIVE approaches to convert the active concentration to a HED. Through this process, confidence in subsequent MOE calculations will increase and some Level 2 results may be accepted for decision-making, partially due to narrowing uncertainty regarding the MOE and knowledge of likely MOAs.
The computational models that apply algorithms to estimate TTCs and exposures permit estimation of approximate margins of safety -MOS Nicolas et al., in preparation). These TTCs are derived from in vivo toxicity datasets and include a 100-fold safety factor. Due to the use of the 100-fold safety factor, MOS values, i.e., the TTC divided by expected exposure, are more conservative than MOEs derived using a HED. Similar approaches have been applied to the large CERAPP dataset to calculate both TTCs and exposures, thereby providing approximate estimates of MOEs. Other tools at Level 1 include in silico metabolite identification (met-ID) using tools such as the OECD QSAR Toolbox metabolism profiler 4 or ACD/Labs Meta-Sense biotransformation map software 5 . The results of computational scrutiny of a compound or group of compounds could provide compelling results, i.e., very low predicted exposures or lack of signals for expected toxic liabilities, leading either to a much-reduced level of concern or exemption of the chemical(s) from further study.
Some combination of higher expected exposures, indications of specific types of toxicity from QSAR methods, or physicochemical properties that indicate long half-lives in a target species or the environment, would raise flags, indicating a higher priority for considering further testing or, in the case of new compound development, possibly a decision to discontinue further development. In cases where Level 1 analyses fail to provide a sufficiently large MOS, the next step would be to use Level 2 NAMs to test for biological responses in high-throughput cellular or subcellular assay platforms, first considering NAMs that target possible toxicity identified in Level 1.

Level 2: High throughput in vitro screening
Level 2 comprises assays that can be easily run on a large number of compounds in high-throughput mode. Examples here are the Tox21 and ToxCast assays from NIH and EPA, respectively, which could be run on essentially any compound that is soluble in water or DMSO and is not highly volatile. Ideally, Level 2 assays should be tailored toward endpoints of regulatory concern, with a good understanding of how the assay fits within known MOAs or adverse outcome pathways (AOPs). The original intentions of Tox21 and ToxCast were to generate directly comparable data for a large number of chemicals to facilitate grouping of chemicals by MOA, ranking of chemicals within a particular MOA by potency, prioritization of these chemicals for risk assessment by regulatory bodies, and ultimately providing a platform where unknowns could be subjected to the same battery of assays and MOAs assigned based on the pattern of activity seen in the results. Even though many of these goals were later shown to be out of reach, at least for the assays chosen to be part of the Phase I and Phase II efforts, these assays still show promise for grouping chemicals as part of a process known as "biological read across." This approach is similar to hazard identification in the traditional risk assessment process, where activity in Level (T47D). None of the ToxCast in vitro assays evaluate uterine response, even though the uterus is a critical target tissue for estrogenic compounds and there are differences in breast and uterine responses to various estrogenic compounds (Barakat, 1995). Our approach for creating a FFP assay for estrogenicity (Miller et al., 2016(Miller et al., , 2017Beames et al., in press) has relied on using a human adenocarcinoma cell line, i.e. Ishikawa cells, and confirming that the cells retained all components of the estrogen signaling network involved in the control of cell proliferation. To test the utility of the in vitro model to predict quantitative dose-response relationships in the species of interest, we tested the assay output for endogenous estrogen and known human uterotrophic drugs against clinical and epidemiological data. The FFP in vitro uterine assay consistently predicted chemical concentrations associated with human estrogenicity. Further, the assay predicted activity at lower concentrations than any of the ToxCast HT assays (Miller et al., 2016).
Based on these studies, we had confidence that the assay was sufficiently sensitive to predict safe levels of human exposure for uterotrophic compounds. To test the application of the assay for the broader universe of environmental compounds, we ran dose-response curves for 116 chemicals (Beames et al., in press), including chemicals that were determined to be estrogenic (n = 106) or non-estrogenic (n = 10) in the EPA estrogen model (Browne et al., 2015;Judson et al., 2015), and possible metabolites of 5 parent compounds from the ToxCast library. The Ishikawa assay was compared to ToxCast assay results, as well as in vivo rodent uterotrophic and two-generation reproductive study data. Active concentrations in the uterine proliferation assay were consistently among the lowest of the test models, whether comparing in vivo or in vitro results, indicating that observed activity in the in vitro model would provide a sufficiently protective point of departure for risk assessment. However, when compared to animal studies, approximately 41% of the compounds that caused uterotrophic response in guideline-like rodent studies did not show proliferative activity in the human cell-based assay.
This disconnect between the human in vitro assay and the rodent in vivo assay highlights an important issue in validation of NAMs, i.e. selection of the benchmark or comparator. When comparing in vivo studies in rats with assays in human-relevant cells, challenges arise related to decisions about defining compounds as positive or negative. Should animal studies be considered the "gold standard" when we know that many substances cause an effect in rats but not in humans, or vice versa? Or should we not even attempt to "predict" rodent responses and benchmark NAM predictivity against human data only? This is an open question regarding validation, and one that will (at least for the time being) necessarily be addressed on a case-by-case basis. Here, the FFP estrogen model faithfully reproduced human response for the admittedly few compounds for which clinical data was available (n = 4), but only showed activity for just over half of the rodent uterotrophic compounds. While we neither have nor ever expect to have human data with most chemicals, it nonetheless bears emphasis that the responses in test animals in vivo frequently differ from those in human populations, and animal tox-

Level 3: Fit-for-purpose assays and safety assessment
Level 3 assays would be designed so that the output from the NAMs in Level 2 and from more refined exposure assessments increase confidence in the estimated MOE and support formal risk assessment without moving on to more complex assays. These applications require FFP assays . These FFP platforms are targeted cellular assays that are developed based on an understanding of human biology. The landscape of FFP assay platforms in Level 3 is diverse with many more under active development. Complex CNS tissues -socalled mini-brains -are just one example (Pamies et al., 2017;Boutin et al., 2018). Three-dimensional cultures of cells derived from various tissues are also being used to develop more relevant platforms and are increasingly integrated with HCI technologies in order to simultaneously assess multiple phenotypic endpoints -the combination of these platforms provides a form of cellular pathology (Kabadi et al., 2015).
One of the most publicized successes of the ToxCast program to date, the prediction of in vivo rodent uterotrophic results using in vitro assay data, used a computational model . This early success has energized efforts in the field of endocrine disruption to try to replicate this success for other, related MOAs, such as using androgen receptor data to predict in vivo Hershberger assay results. For estrogenic mode of action, over one hundred of the 1812 evaluated chemicals were predicted to be endocrine-active based on this computational model (Browne et al., 2015;Judson et al., 2015). These results appear robust as model results demonstrated that the method worked well for a set of reference chemicals by correctly identifying agonist, antagonist and inactive compounds with high sensitivity and specificity. Using HT-IVIVE and the results of the assays allowed estimation of pathway-altering doses. While this estrogenicity model utilized a variety of molecular and cellular endpoints, all were based on signaling through two estrogen nuclear receptors (ESR1, also referred to as ER66/ERα, and ESR2/ERβ). The type of analysis is akin to Level 2 in our scheme.
The analysis of estrogenicity using information on only the two receptors ERα and ERβ ignores other pertinent information on estrogen signaling. FFP assays for estrogenicity should be designed to account for human biology, focus on specific cellular outcomes, and assure that the cell system has the molecular pathway components necessary to recapitulate the cellular read-outs relevant to an adverse response in vivo. The coordination of estrogen responses in any tissue integrates the action of at least five estrogen receptors, including both classical nuclear and membrane-bound receptors: ER66, ER46, ER36, ERβ, and GPR30 (Miller et al., 2017). The goal in FFP assays for estrogenicity is to generate appropriate cellular response assays, quantitative IVIVE approaches, refined exposure information, and inferences about possible bioactive metabolites to create a package of information sufficient for MOA-based risk assessment without resorting to in vivo testing.
The ToxCast estrogenicity assays were conducted with cellfree and pathway-overexpression systems and using a phenotypic assay measuring proliferation in a breast cancer cell line suited to predict parent chemical clearance, and the domain of applicability is centered in the pharmaceutical compound space. However, recent efforts are testing their suitability for use as first tier metabolism predictions of environmental compounds (Casey et al., 2018;Moreau et al., in preparation). A proof of concept study was completed that integrated TTC values with HT exposure modeling to provide prioritization level MOEs for close to 7000 substances (Patlewicz et al., 2018). More recently, TTC values derived for approximately 40,000 substances (Nicolas et al., in preparation) have been disseminated publicly as a searchable table on the internet 7 . Advances in HT exposure modeling have now yielded median human intake rates and credible upper bound intervals for more than 450,000 chemicals in various U.S. population demographic age groups (Ring et al., 2019). As one component of the priority setting in the US EPA's projected approach for conducting risk-based prioritization of existing chemicals under TSCA, the Agency intends to use HT exposure modeling and TTC values to calculate TTC-to-exposure ratios (US EPA, 2018c).
MOE calculations for Level 2 screening assay results with consideration of exposure began with simple kinetic models, assuming steady-state oral exposure and determination of HEDs (Wetmore et al., 2013). IVIVE methods have also been used with AC50 values from ToxCast estrogenic assays to generate HEDs and MOEs by comparing these HEDs to predicted human exposures. In this way, these ToxCast estrogen assay-derived MOEs could be used as stand-alone risk-based screening values or compared to the MOE of the ubiquitous dietary phytoestrogen to provide additional context (Becker et al., 2015).
We recently proposed an alternative dosimetry measure for fruit and vegetable mixtures (Wetmore et al., 2019). The dose measure was related to daily intake of the juices in relation to their bioactivity in the BioMap ® assay platform. This measure of activity was then compared with the equivalent adjusted daily intake of agrichemical residues found in these produce materials in relation to their potency. While this measure of dose does not account for pharmacokinetics of the juices, which are complex mixtures, the adjusted daily intake allows comparison of the degree of assay activity expected from the produce and the agrichemicals. The contribution from most of the produce juices was more than 1000-fold greater than the contribution of bioactivity associated with agrichemicals used in growing this produce. This examination of fruit and vegetable juices falls into Level 2 testing with mixtures using a total intake dosimeter. More extensive examination of mixture kinetics could follow with identification of major components or fractionation into different chemical subclasses that could be studied individually. Depending on the test materials, especially for mixtures and chemical substances of unknown or variable composition (e.g., biological products, herbal medicines and dietary supplements, foods), dose measures other than HEDs will need to be considered and evaluated.
Whether dealing with mixtures with known constituents or single chemicals, available computational tools can predict likely metabolites (met-ID) and infer possible toxicity of test com-icity results should not necessarily be considered the gold standard for comparison (Blaauboer and Andersen, 2007).
Other examples of FFP assays have been pursued with p53 mediated DNA damage (Clewell et al., 2014Adeleye et al., 2015;Clewell and Andersen, 2016), PPARα signaling (Mc-Mullen et al., 2014, 2019 and adipocyte differentiation (Foley et al., 2017;Hartman et al., 2018). Our work with these FFP assays has helped establish criteria for the cellular read-outs to ensure applicability of the results for assessing adversity. These assays can be particularly informative of human relevance. For example, with the PPARα assay, criteria have been established for comparing results from human cells to in vivo outcomes in rodents that appear to be qualitatively different from responses expected in humans (McMullen et al., in press). Another opportunity arising from development of FFP assays is the possibility of examining the signaling networks controlling various cellular responses to develop computational systems biology modeling tools to assess the biological basis of cellular dose-response behaviors, including a better understanding of threshold behaviors at the cellular and organism level (Zhang et al., 2014(Zhang et al., , 2015Tyson and Novák, 2015;Clewell and Andersen, 2016).

Level 4: More intact systems
Ultimately, the goal of defining multiple levels for context-appropriate testing is that over time the tools used for targeted in vivo studies in test animals will be regarded as studies of last resort. Problems arising from high dose animal studies in relation to human relevance and kinetic non-linearities at high doses are well-documented and these high dose rodent studies frequently raise more issues than they resolve. Instead of simply moving to in vivo studies, over time Level 4 should become populated with more complex assays, including multi-cellular and multi-dimensional assays, human-on-a-chip (Zhang and Radisic, 2017;Zhang et al., 2018), linked tissue surrogates with provisions for liver metabolism and inter-tissue circulation of metabolites, and inclusion of metabolite generating cells or subcellular fractions within the assay platforms (Zhang and Radisic, 2017;Zhang et al., 2018), providing a variety of biologically inspired test systems for conducting more integrated toxicity testing (Marx et al., 2016).

Dosimetry, extrapolation and MOE considerations
While the preceding description of the risk context-related levels focused more on assessing the biological targets, MOAs and dose-response, each level also requires consideration of dosimetry, IVIVE and exposure assessment in order to estimate MOSs or MOEs to place results from NAMs in an appropriate risk/safety context (NRC, 2007).
At Level 1, computational methods permit HT predictions of exposure and metabolism including estimation of intrinsic clearance (CL int ) and unbound fraction (F u ) based on chemical structure. These metabolism prediction tools are currently best threshold models augmented with use of uncertainty factors for non-cancer risk evaluations).
An integrated approach with bioactivity testing and exposure assessment for assays at Level 4 (Webster et al., 2019) employed an MOE approach referred to as a bioactivity exposure ratio. Results from HTS assays (ToxCast), in vivo screening level assays, and in vivo apical tests of adverse effects were used to inform the need for conducting additional testing. Importantly, this case example involved several data-rich substances and showed that in vitro MOE values were actually lower than the in vivo MOE values, an observation "that this health protective approach could facilitate a substance's prioritization or deprioritization for further action, including the need for comprehensive in vivo testing."

Domain of applicability
The goal of organizing NAMs within these four levels was to consider when data from any of the levels would return adequate information to determine product safety for intended uses. The organization then provides a focus on the risk context, not simply the types of assays and computational tools available. Its applicability to particular chemistries or industrial sectors depends on the end-uses of products and whether the value of the product is associated with some biological activity. With environmental compounds, where the functionality is not related to specific biological activity, these tools offer significant promise for safety assessments based on measures of MOEs (TTC or AC50 divided by exposure). This approach is more safety assessment-based and differs from risk assessment procedures over the last 40 years where there was an attempt to estimate a human dose (exposure) that would be expected to produce some low incidence of response in a human population. This difference, i.e. a safety assessment versus risk assessment emphasis, was highlighted as a key point in applying TT21C information rather than in vivo animal studies for decision-making (Andersen and Krewski, 2010).
The transformation from traditional risk assessment approaches to this problem-oriented, safety assessment approach based on the use of NAMs across the different levels should be appropriate for environmental compounds, GRAS substances, cosmetics and food additives (Rovida et al., 2015;Hartung, 2018). The use with functional food additives or cosmetics with targeted biological activity poses challenges depending on the nature of the biological activity, the level of exposure from the intended uses of products, and on the possibility of inappropriate use conditions leading to excessive exposures. These two classes, i.e. functional foods and bioactive cosmetics, are intermediate between environmental compounds and those marketed because of end-use bioactivity.
Pharmaceuticals and pesticides pose challenges in that intrinsic biological activity is essential to efficacy for their intended uses. These classes of compounds can have both excessive on-target and unanticipated off-target biological activity. The discussed scheme for using NAMs for safety assessment would likely need to be customized for pharmaceuticals and pesticides. pounds (QSAR). These predicted values could, in theory, be used as an early estimate of bioactivity and MOE, though metabolite prediction software is presently more qualitative than quantitative. Level 2 would include measurements of CL int and F u in HT assays to estimate steady-state concentrations expected from continuous daily exposures. The ratio of the HED and actual human exposure provides the MOE at Level 2. Currently, the HTS assay systems focus on clearance of the parent chemical, assuming that metabolism is an inactivating step for the chemical. This assumption provides a first-order estimate of risk based on parent chemical but leaves bioactivation via metabolism unaddressed. Efforts are currently underway to address this gap by incorporation of metabolism into HTS screens, through the addition of hepatocytes, cellular fractions (S9) or recombinant enzymes (DeGroot et al., 2018). Unfortunately, the broader testing community rarely accounts for activation via metabolism in HTS evaluation.
The role of metabolism in toxicity will more likely be addressed in Level 3, where metabolic competence can be incorporated into the FFP assay designs. FFP assays conducted in the absence of components that ensure production of metabolites allow for the assessment of the bioactivity of the test compound itself, although its activity in an organism would depend on metabolism and bioavailability. With FFP assays at Level 3 it becomes particularly important to account for metabolism, either by incorporating metabolically competent preparations into the FFP-assay or by procuring potential metabolites and testing them (Beames et al., 2019).
For Level 3 studies to be regarded as sufficient for risk assessment, it may be necessary to estimate HEDs for more diverse exposure conditions and for multiple routes of exposure. By combining computational approaches for metabolism and pharmacokinetics (IVIVE, PBPK) with in vitro readouts for the suite of metabolites expected in the blood for a given exposure, it should be possible with more advanced kinetic models to develop a combined estimate of potency that is predictive of in vivo experience for oral, dermal and inhalation exposures and for multiple compounds. An example of assessment of parent compounds and active metabolites was completed in a case study looking at combined exposure to the multiple blood metabolites expected from exposures to both diethylhexyl phthalate and dibutyl phthalate. Here, in vitro assays evaluated potency of both parent phthalates and active metabolites, and PBPK modeling was used to predict serum metabolites at expected human exposures (Clewell et al., submitted).
Broad screening of possible MOAs along with Level 1 chemical characterization may indicate that responses are due to direct chemical reactivity or broad low-affinity non-covalent interactions (Judson et al., 2016) rather than interaction with more specific biological targets. In these cases, no observed transcriptional effect levels (NOTELs) coupled with HT-IVIVE can support decisions about MOEs. Many of the same considerations for Level 3 assays also apply for the more complex assays in Level 4. Of course, decisions based on in vivo studies would use pharmacokinetic (especially PBPK) modeling for assessing internal doses and for selecting extrapolation methods (e.g., The challenges of developing NAM-based approaches with bioactive compounds was highlighted recently in a multi-stakeholder meeting aiming to establish readiness criteria for assessing developmental neurotoxicity (Bal-Price et al., 2018). The approaches with these bioactive compounds need to be fashioned to capture multiple possible MOAs and encourage use of integrated assessment approaches (IATAs) (Tollefsen et al., 2014) that have undergone some level of mechanistic validation (Hartung et al., 2013). Nevertheless, the challenges in pursuing NAM-based safety assessment with pesticides and pharmaceuticals do not diminish the promise of their more rapid application with these classes of products.

Summary
With the explosion of available NAMs in the past decade and changes in the regulatory environment afforded by various initiatives such as the Frank R. Lautenberg Chemical Safety for the 21 st Century legislation 8 , it is an opportune moment to assess how information developed using NAMs will shape approaches for various risk assessment decisions. In looking over the possibilities for their use, there is no one-size-fits-all solution; rather, the context of the decision needs to drive the selection of NAMs used in any risk assessment. This contribution organizes NAMs into different levels, emphasizing the types of decisions that can follow from completion of studies at each of the levels. Importantly, most risk-based decisions do not require bringing compounds or classes of compounds through a tiered strategy (i.e., going lockstep from Level 1 through Level 4). Moving through just one or two of these levels should allow decisions about relative risks of products, including absence or low degree of potential anticipated toxicity and low expected exposure (i.e., very high MOSs or MOEs). Level 2 and 3 assays should provide the necessary information for assessing MOAs, AC50s or LECs and, when combined with improved human exposure assessment methodologies, should become preferred approaches for most safety assessments. The context-dependent applications of NAMs and the functional roadmap we describe may be useful in motivating additional case examples documenting the utility of, and confidence in, using a defined set of NAMs for specific decisions. In addition, the framework and roadmap can also help to identify where additional scientific research is needed to build greater confidence in various NAMs so that they can be used in the future with the necessary degree of confidence.