A Critique of the EC’s Expert (Draft) Reports on the Status of Alternatives for Cosmetics Testing to Meet the 2013 Deadline

the Cosmetics Directive (Directive 76/768/eeC), now recast as Regulation 1223/2009, bans testing on animals for cosmetic purposes as of March 2009. In addition the “marketing” (i.e. import and sale) of products and ingredients tested on animals outside europe also is prohibited after that date. A postponement of the marketing ban until 2013 was provided for three endpoints however – toxicokinetics, repeated dose, and reproductive toxicity. At the time, these tests were considered harder to replace. Under the terms of the Cosmetics Directive, the european Commission shall publish a proposal for an extension of this deadline if it decides that alternatives for these three tests will not be available by this date. As part of this process, the Commission gathered experts from across europe in May 2010 to produce a report on the status of alternatives for cosmetics testing. the aim of the experts’ report was to evaluate whether alternatives would be available by 2013 and, if not, to establish recommendations and a timeline for complete replacement. In July 2010, a draft version of the report was made available for public comment, comprising five chapters covering five endpoints: repeated dose, skin sensitization, carcinogenicity, toxicokinetics, and reproductive toxicity (http://ec.europa.eu/consumers/sectors/cosmetics/documents/public_consultation/index_en.htm). the reports covered how information on the endpoint is currently derived, (i.e. Scientific Committee on Consumer Safety (SCCS) requirements for in vivo tests) and then summarized the various alternative approaches, including in vitro methods, QSARs (Quantitative Structure Activity Relationships), and Summary The 7 th Amendment to the EU’s Cosmetic Directive (now recast as Regulation 1223/2009) bans the testing of cosmetic ingredients and products on animals, effective 2009. An extension until 2013 was granted, for marketing purposes only, for three endpoints: repeated dose, toxicokinetics, and reproductive toxicity. If the European Commission determines that alternatives for these endpoints are not likely to be available, it can propose a further extension. To this end, the Commission has instructed experts to produce reports on the status of alternatives for the 2013 deadline. We criticized the draft reports on a number of issues. First, the experts fell into the “high fidelity fallacy trap,” i.e. asserting that full replication of the in vivo response, as opposed to high predictivity, is required before an animal test can be considered useful for regulatory purposes. Second, the experts’ reports were incomplete, omitting various methods and failing to provide data on the validity, reliability, and applicability of all the methods discussed, regardless of whether the methods were in vivo, in vitro, or in silico. In this paper we provide a summary of our criticisms and provide some of the missing data in an alternative proposal for replacement of animal tests by 2013. It is our belief that use of the Threshold of Toxicological Concern (TTC) will be a useful method to mitigate much animal testing. Alternative approaches for carcinogenicity and skin sensitization could be considered sufficient in the very near future, even though these tests are not listed under the 2013 extension. For repeated dose, toxicokinetics, and reproductive toxicity a combination of in vitro methods may be able to provide appropriate protection for consumers, especially when viewed in the context of the poor predictivity of the animal models they replace. We hope the revised report will incorporate these comments, since a more thorough and positive review is required if the elimination of animal testing for cosmetics in Europe and beyond is to be achieved.

ures. We include comments from experts that assisted with our submission to the Commission. In addition, for the purposes of this paper, for each endpoint, we offer a stepwise approach to testing, incorporating the ttC approach and validated methods that could be considered as adequate. We do this to stimulate debate on the adequacy of alternative methods and to provide an example of a basis from which we would have liked the experts to start. At the time of writing, the experts' final report has yet to be published, so our comments remain relevant to the draft version only. It is hoped that some of the comments included here have already been taken on board in their revisions and will be seen in the final report. Since there has been considerable delay from the publication of the draft report (July 2010) to the appearance of the final report (still not available in March 2011), we think there is value in offering our position on the status of alternative methods in advance of its publication.

General comments on experts' approach
the Commission, in instructing the experts, obviously gave them terms of reference (Anon, 2010). the principle aim of their reports was to evaluate the "state of play" of alternatives for the 2013 deadline, and "that the state of play must be as neutral as possible." From the outset, however, the assumption was that alternatives would not be available: "It should provide a wide and objective overview on the technical difficulties in complying with the ban in relation to tests evaluating repeated-dose toxicity, reproductive toxicity and toxicokinetics, in particular those for which there are no replacement methods or strategies yet under consideration. It should summarize the status and prospects of alternative methods and a scientifically sound estimate of the time necessary to achieve full replacement of animal testing for the above mentioned complex endpoints." What was not made clear, however, was that the evaluation of what could be considered a "replacement" should refer to methods that can be used for the purposes of the Cosmetics Directive, i.e., for regulatory purposes. It appears that the approach taken throughout the reports was broader than this, i.e., to examine what is required before the entire mechanism of action of a particular toxic reaction can be understood and modeled. this falls into the high-fidelity fallacy trap recognized in 1959 by Russell and Burch (1959), and also known as the "uncertainty paradox" (Schaafsma et al., 2009). this is the assumption that in vivo animal models are automatically superior models of the human response and that in order to be useful, non-animal methods must replicate the in vivo animal model in full.
this approach fails to recognize that not all aspects of the mechanism of action need to be covered by a model for it to be highly predictive (and therefore useful for regulatory purposes). If the requirement of any given replacement method is to fully replicate the in vivo response, then this obviously will result in inordinately long timescales for their development. It is also an unfair comparison when animal models, while providing a picture of the "whole body response" (which in vitro methods cannot), nonetheless are the wrong body and therefore have unavoidable errors due to species differences. As a result, ani-other in silico methods. A table was consistently used to list the methods by mechanism of action, area of application, and "status" (i.e., in research and development, optimization, prevalidation, validation or regulatory acceptance).
We submitted extensive comments on the draft chapters as the eCeAe (the european Coalition to end Animal experiments). the eCeAe was one of the leading pan-european organizations that campaigned for an end to cosmetics testing on animals in europe under the Cosmetics Directive. We therefore have an interest in ensuring the bans are upheld, for ethical, if not for scientific reasons. Regardless of the ethical dimension to this debate, it was our opinion that the experts' reports were significantly scientifically flawed in a number of ways. This is a concern to us since we wish the debate to be founded on the best possible scientific evidence and assessment. We have had a recent poor experience with a similar expert report that is currently being investigated by the eU Ombudsman for lack of balance (Bailey and taylor, 2009), and we would like that to be avoided in this case.
First, we contend that the experts applied the wrong legal test, asking whether it was possible to replicate the animal model in full, and not whether alternatives could predict human effects reliably. Given this "ultimate challenge" approach, it is not surprising that the experts believed the 2013 deadline would not be met for any of the five tests and could not, in most cases, offer a rational timescale for when they would be.
Irrespective of differences of opinion regarding the correct approach, we felt the reports, on the whole, lacked a proper evaluation of the status of the alternatives. Without this, we feel it is impossible to assess where the weaknesses are and what additional research is needed. A proper evaluation, in our opinion, includes assessment of the reliability of the method, its accuracy (including concordance with in vivo or gold standard methods, sensitivity and specificity), applicability domain (based on the known mechanism of action, range of substances used in any evaluation or known physical or biological limitations of the test) and, finally, the availability of the method. The point at which methods would, in all likelihood, be considered adequate by the regulators (i.e. the SCCS), if they applied the correct legal test, should have been given and all methods (whether in vivo or in vitro) rated against this.
the reports also were inconsistent in their approach, with some chapters (e.g., the one on toxicokinetics) adopting a more proactive, forward-thinking strategy. Some included concepts such as the threshold of toxicological Concern (ttC), an approach that can mitigate some tests, while others did not.
Finally, the inclusion of two additional endpoints (skin sensitization and carcinogenicity) in the reports was done under instruction from the Commission, with the implication that they too fall under the 2013 deadline, which is not consistent with the Cosmetics Directive and is of grave concern to us.
What follows is a summary of our comments on the draft reports, providing examples of where each chapter could have been improved. We looked at a number of key measures of quality in the chapters, including neutrality, completeness, use of quantitative measures, relevance of information to the cosmetics sector, as well as consistency across chapters for these meas-as discrete endpoints, distinct from repeated-dose toxicity in EU legislation. Examples abound: in REACH (Regulation No 1907 (SCCP, 2006) and eCVAM reports on the status of alternative methods (Zuang et al., 2010) list these endpoints distinctly.
there is no written evidence from the time of the negotiations of the Cosmetics Directive testing and marketing bans that would suggest that the european Parliament intended that "repeated-dose" be used to cover several animal tests in the way the Commission is implying. Subsequent assumption on the part of the Commission does not alter the legal text. We have written to the Commission about this, and we intend to challenge if any proposal to extend the deadline also applies to these endpoints.

Repeated dose (chapter 1)
Criticisms of the experts' draft report this chapter discussed the in vitro models for common targets of organ toxicity such as hepatotoxicity, nephrotoxicity, cardiovascular toxicity, neurotoxicity, and pulmonary toxicity, but it did not evaluate the evidence for the validity of these methods in isolation or combination. the experts concluded that the 2013 deadline could not be met by these methods, as there is a need to "reproduce integrated, whole-organism responses," and thereby fell into the high-fidelity fallacy trap. Although there was a section discussing the limitations of in vivo models, this was not quantitative, and no reference was made to studies looking at the predictivity (validity) of rodent models of sub-chronic effects. For example, the review of Olson et al. (2000), which found that rats and mice only predicted 43% of human effects for 150 pharmaceuticals, was not included, nor was the paper from Spanhaak et al. (2008), where concordance of hepatotoxic effects between rodents and humans was only 60% for 1,061 pharmaceutical compounds and 46% for another set of 137 compounds. Finally, it must be remembered that, although widely accepted, the procedure to derive Margin of Safety values from No Observed Adverse Effect Levels (NOAEL) in test animals is not validated for the purposes of predicting human health risks (Blaauboer and Andersen, 2007). the chapter did not always give a neutral, complete, and quantitative evaluation of non-animal testing methods and their applicability to repeated dose end points. Actual developments were not reported, for example, on standardized organotypic lung models for sub-chronic or chronic toxicity testing (Mu-cilAir™ and epiAirway™), and new culture techniques that allow the maintenance of physiological functions over several weeks of co-cultured human hepatocyte (Schmelzer et al., 2009;Zeilinger et al., 2010), renal epithelial cell (Jennings, 2010), and primary cardiomyocytes (Sreeijt, 2008). In addition, industry strategies to include metabolism in in vitro models, as for mal models themselves are not accurate predictors of the human response. For example, extrapolation factors have to be added to the results of animal tests to account for inter-species differences. And yet the experts did not evaluate the evidence for the validity of the alternatives in relation to the validity of the animal models. this is not a correct or fair legal test. the true test is whether alternative methods are sufficiently well developed and predictive of human responses to assure safety of human health to the same extent as animal models -or better. Alternative methods go through a lengthy validation process that includes an assessment of their predictivity. the results often are compared against the result of animal tests on the same chemicals, but attempts often are made to compare them against effects known in humans as well. For example, the reconstituted human skin epithelial models have been compared against human skin reactions and found to be more predictive than the rabbit test they now replace (Jirova et al., 2010). Assessing predictivity is difficult, however, because data on the gold standard (i.e., the human) is often limited, particularly for chemicals, since these, in general, should not be deliberately tested on humans. there are reviews of the predictivity of animal models for human effects, however, in which these are known from history of use. We present this information for each endpoint in order that it may be compared to similar data for the alternative methods. We do this on the premise that it is not possible to evaluate what the shortfalls of alternative methods are (and there are shortfalls) without first asking the question about what is or would be adequate.
3 The addition of skin sensitization and carcinogenicity to the 2013 endpoints the experts were instructed by the Commission also to consider the endpoints of skin sensitization and carcinogenicity. While there is no issue with asking for an update on the progress in alternatives for these endpoints, there also is no legal basis for any legislative proposal to extend the deadlines with respect to these two endpoints. the possibility of extending the 2013 deadline mentioned in the text of the Cosmetic Directive only applies to repeated dose, toxicokinetics, and reproductive toxicity; neither skin sensitization nor carcinogenicity is listed in Article 18(2) of the recast Cosmetics Directive 1223/2009. the Commission, in its 2004 report (SeC, 2004), which postdates the 7 th Amendment agreed in 2003, made an assumption that the term "repeated dose" includes these endpoints, and they have continued to do so in subsequent reports. the Commission's argument seems to be that these tests can also be considered repeated-dose toxicity because animals may be subjected to more than one dose of the substance in question. this does not explain, however, why reproductive toxicity is listed separately, as tests for this endpoint also involve repeated dosing. It is our opinion that this position is untenable, since all these endpoints are terms of art, with a clearly recognized meaning in legislation, international guidelines, and toxicology industry usage. Carcinogenicity and skin sensitization always are listed the experts' report covered this approach but did not come to a conclusion about its usefulness or its range of applicability beyond the fact that it could "contribute to intelligent testing strategies to help reduce and refine animal use." In addition, other non-testing approaches for the risk assessment of longterm exposure to cosmetics were not adequately represented in the draft report, such as read-across or margin of safety values by grouping of chemicals, or weight-of-evidence considerations that take into account experience with previous consumer use (Weed, 2005).
Finally, the experts' report also suffered from "the common misconception that reliable QSAR models can be derived only for biological events with a common mode of action. It is important to remember that these methods do not model toxicological mechanisms but try to identify the relationship between compound properties and toxicological effects. With modern data mining and machine learning methods, reliable prediction models can be obtained from non-congeneric compounds, even for complex endpoints where many mechanisms may still be unknown. Models with improved predictivity and a broader applicability domain could be generated if software engineers would be granted access to the existing high quality test data which are not included in public databases" (Christoph Helma, personal communication).

An alternative analysis
Repeated dose information (NOAEL) is required for new cosmetic ingredients, but in many cases this can be avoided by use of the ttC concept, since substances are used in such low quantities that no adverse effects would be expected. In instances where this cannot be achieved, then a battery of in vitro tests should be employed, focusing on the liver which is the key target organ for repeated dose toxicity, followed by kidney, heart, nerves, lung, and immune system and selecting the more sensitive end point for the determination of the NOAEL (Prieto et al., 2006). Several in vitro models, developed as stand-alone methods, are at various development/validation stages in relation to most common targets for toxicity (tab. 1). Although studies have shown these tests can predict effects seen in humans, the practical (but not insurmountable) problem remains: how to combine the results from several tests into a single "safety factor" for risk assessment purposes. Suggested approaches, such as in Prieto et al. (2006), could be used as a basis.

Skin sensitization (chapter 2)
Criticisms of the experts' draft report Overall, we disagree with the author's conclusions that alternatives for risk assessment decision-making for skin sensitization are not yet available. this is because several in vitro methods show extremely high concordance with in vivo data, in the realm of 80% accuracy (e.g. 89% concordance of Direct Peptide Reactivity Assay with in vivo data on 82 chemicals), which is considered sufficient for ECVAM validation purposes (eCVAM, 2009). In addition, in contrast to the guinea example, advanced new in vitro models to assess dermal penetration, including those of nanoparticles, and dermal metabolism (Jäckh, 2010;landsiedel, 2010) could have been better detailed. Surprisingly, the report provided no information on the outcomes of Framework 7 project Predict-IV (on the optimization, standardization and characterization of the long-term human-based cell culture models utilized for assessing hepatotoxicity, nephrotoxicity and CNS toxicity) and Framework 6 project Predictomics (focused on the identification of biomarkers of chronic toxicity based on combined genomic, proteomic, and cytomic analysis of cells exposed to model hepatotoxins and nephrotoxins).
the use of Integrated testing Strategies (ItS) to integrate in vitro models of various organ toxicities and in silico techniques was mentioned in the experts' report, but no specific strategies were discussed despite the chapter's conclusion that this is the way this endpoint may be replaced. No specific reference to the FRAMe ItS (Grindon et al., 2008) was made in this context, and other strategies also were absent (Combes, et al., 2006;Prieto, et al., 2006;Boekelheide and Campion, 2010). As an aid to which organs need to be targeted by in vitro models, more information on the percentage of adverse effects seen across the organs could have been provided from data on human exposure to chemicals. the Boekelheide and Campion (2010) paper suggests a systematic approach to the analysis of results from batteries of in vitro tests, in analogy to a system of aircraft accident investigation. this new toxicological Factors Analysis and Classification System can discriminate on a mechanistic level between different types of failures that are initiated by a toxicant. A manifest "active failure" as a last step is conditional on previous "latent failures." the system will allow the development of a fully fleshed out Taxonomy of Adverse effects.
One admirable approach the authors of this chapter undertook was to ask companies what their strategies for avoiding animal testing were in relation to repeated dose. It was disappointing that only Unilever and Nestle responded to their request. these companies were employing the ttC approach in order to establish whether testing is genuinely necessary. the ttC approach is based on the concept that for all substances there is a level of exposure below which there is hardly any risk to human health, regardless of the toxicity of the substance. the level of exposure depends on very broad classes of likely toxicity; those chemicals not at all likely to be toxic can have higher exposure levels. With respect to cosmetics, ingredients such as preservatives, fragrances, and dyes are present in only tiny amounts within a product, and so it is possible that for many ingredients exposure will never exceed the ttC. Rather than new animal tests, then, all that is required is an evaluation, based on chemical structural similarity to other substances, as to the likely risk, which then allows a calculation of maximum daily exposure. This concept was used first for food additives, but COlIPA research has shown it to be relevant for dermal (Kroes et al., 2007) and inhalation (Westmoreland et al., 2010) exposure to cosmetics, and examples are now available; the SCCS is reviewing the concept's usefulness at the moment.

Alternative Evidence of validity Status
Step 1 -Low exposure substance (no proteins, heavy metals, polyhalogenated-dibenzodioxins) No testing needed if human exposure is below 1800 mg/day for Cramer class III (low toxicity expected); 540 mg/day for Cramer class II (medium toxicity expected), 90 mg/day for Cramer class I (higher toxicity expected).

TTC
Relevance for cosmetics shown by COLIPA research SCCS on-going evaluation for cosmetics. (Kroes et al., 2007). Database on repeated dose oral toxicity data from 613 substances (Munro et al., 1996).
Step 2 -Higher exposure substance If human exposure exceeds TTC levels, conduct below in vitro testing in combination with QSARs for specific endpoints in a weight-ofevidence approach and select the more sensitive end point for determination of the NOAEL (Prieto et al., 2006).

In vitro hepatotoxicity on
Pfizer study found 80% of 243 human hepatotoxicants Requires validation studies. human liver cell lines were detected (O'Brien et al., 2006). Long term cell lines and cultures 100% of 10 hepatotoxicants detected (Horii and now available. Yamada, 2007).

In vitro kidney cell lines
Good prediction of 15 nephrotoxicants in vitro ECVAM recommended validation studies (Duff et al., 2002).

Cardiotoxicity (heart)
In vitro heart cells 81% agreement between in vitro and clinical cardio-Requires validation studies. toxicity on 6 compounds (Schwengberg et al., 2004). Up to 97% agreement with in vivo test for 4 cardiotoxicants (Inoue et al., 2007).

In vitro neuronal cell test
Excellent agreement with in vivo test for According to experts' report, ring trial organophosphorus compounds (Malygin et al., 2003).
ongoing with EU and US labs.

CFU-GM
Accurate prediction of in vivo results with 5 out Validated by ECVAM in 2000 (from bone marrow cells) of 6 test substances in pre-validation study. (ESAC statement, 2006). Positive results obtained on additional 20 substances (Pessina et al., 2001).

In vitro human whole blood
Results correlated well with the in vivo data ECVAM pre-validation in 2002. cytokine assay on 31 compounds (Langezaal et al., 2002).

Computer models for specific toxicity
Computer models: TOPKAT (based on 393 chemicals from US EPA, FDA): Accepted for regulatory purposes for TOPKAT able to predict 30% LOAELs within a factor of 3; cosmetics, biocides, plant protection DEREK 60% within a factor of 10; 96% within a factor products and chemicals (REACH).
detailed discussion of such approaches is crucial. Finally, no reference was made to statistically-based models developed within the Framework 6 project CAeSAR for skin sensitization and available for online use via the web (http://www. caesar-project.eu).these models have been developed and tested under stringent quality criteria to fulfill the principles laid down by the OECD, and the final models offer a robust and reliable method of assessing skin sensitization for regulatory use (Chaudhry et al., 2010).
In the section devoted to the animal test methods, such as the Local Lymph Node Assay (LLNA), GPMT or Buehler test, no evidence for their reliability, predictivity, or applicability was given. For example, it was not stated that, while the LLNA has been formally validated for hazard identification for regulatory purposes by the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM, 1999) by comparing it with GPMt and available human data, such an evaluation has not been performed for the guinea pig tests. the ICCVAM review found that the LLNA or GPMT only predicted human reactions 72% of the time (SCCNFP, 2000). In addition, no details of the outcomes of key studies on non-animal methods are given -including predictivity and applicability domains, or on the ability of the in vitro assays to estimate potency. therefore, no fair comparison with the current in vivo tests can be made. No reference is made to the TTC concept or, more specifically, to the threshold of Sensitization Concern (tSC) concept recently proposed by Keller et al. (2009). the tSC values (0.91 or 0.30 μg/cm 2 dependent on chemical class) derived in terms of amount per skin area, based on human skin sensitization data on 53 fragrance ingredients from the Research Institute for Fragrance Material of the International Fragrance Association dataset, largely support the dermal sensitization thresholds. Finally, there was no update from the Framework project Sensit-iv and no section devoted to ItS for skin sensitization developed under the Framework project OSIRIS.

Our proposed approach
Our suggestion is that the tSC approach may be used to mitigate any testing if the ingredient is used in very small quantities. If this approach cannot be used, then a combination of QSAR and in vitro peptide reactivity tests may be sufficiently predictive. the mechanism of how skin reacts to "sensitizing" substances is actually well understood and "haptenation", the reaction of proteins in the skin to the substance, is considered the key step. It is therefore possible to determine the skin sensitization potency of a substance based on how it binds to proteins. the Direct Peptide Reactivity Assay, DPRA, used by industry since the early 2000's, has almost completed eCVAM pre-validation. evidence already indicates that this test alone can predict 89% of substances and that further development of a model to consider metabolism would only underestimate the risk to humans. Computer models alone also have similar predicting strength. In addition, two in vitro methods using skin cells (MUSSt and h-ClAt) are being pre-validated by eCVAM with results due in 2011. How these can be used in a strategy is illustrated in table 2. pig maximization test (GPMt) or Buehler test, which only allows a crude estimate of potency (Keller et al., 2009), some in vitro assays provide information on sensitization potency (e.g. DPRA, hClAt assays). Since several methods already have entered the prevalidation stage (e.g. DPRA, hClAt and MUSSt entered prevalidation in 2009), we disagreed with the timelines for replacement of this endpoint: "up to 2019" is an overly cautious timeframe. Under REACH Annex XI, methods can be used for positive prediction if they are suitable for entry into eCVAM pre-validation and for both negative and positive prediction if validated to internationally agreed protocols. It could be argued that the skin sensitization methods mentioned above would satisfy this already, and therefore we ask the question: if they are arguably suitable for predicting worker safety, why are they not (yet) suitable for predicting consumer safety of cosmetics?
Again, the experts' report fell into the high-fidelity fallacy trap by insisting that the complete mechanism of action of skin sensitization needed to be modeled on complete replacement. Not all experts agree with this perception. For example, Roberts and Patlewicz (2010) argue that haptenation (the reaction with protein) is the "single most important and possibly the only important step" in the prediction of skin sensitization. the extent to which a chemical will cause haptenation can be predicted by assessing its ability to react with proteins in vitro. Indeed, "whether a chemical is a sensitizer or not, and how potent it is if it is a sensitizer, depends on its chemical properties and on nothing else" (Dr. Dave Roberts, personal communication). the peptide reactivity tests have been criticized for not taking absorption and metabolism into account, but these experts contend that to do so would only underestimate the risk to the population at large (Roberts and Patlewicz, 2010).
In the in silico tools section, no reference was made to the work of Roberts and Aptula (2008) in the use of mechanistic domains within which simple and interpretable descriptors (logP and rate data) can be used to model the formation of the hapten and, in turn, skin sensitization. A mechanistically based paper that makes use of an in silico descriptor that is useful in modeling reactivity (and thus the LLNA) within the Michael domain is given under enoch et al. (2008a). the same descriptor also has been used to model respiratory sensitization based on the same premise that haptenation is the key step that needs to be understood (the rest of the biology does not affect the sensitization outcome) (enoch et al, 2010). the report also lacks a detailed discussion of a number of expert systems, such as QSAR model toxtree, which can be used to predict potential skin sensitization mechanisms based on the enoch encoding (enoch et al., 2008b) or the Roberts rules for reaction mechanistic domains (Aptula and Roberts, 2006), and Derek for Windows, which has an extensive rule base able to identify skin sensitizers. the rule base within Derek for Windows is mechanistically based, taking the premise that haptenation is the key event that leads to skin sensitization. the use of multiple in silico tools can lead to weight of evidence approaches for the prediction of skin sensitization; a ers is even lower. the experts rightly point out that the current strategy for carcinogenicity is to assess for likely genotoxicity using in vitro methods. Ingredients that are positive in these assays are not taken forward for development. Added confidence in a substance's lack of carcinogenic potential is then provided by conducting a repeated dose test. thus, the impact of a ban on carcinogenicity tests per se is minimal. It was therefore both disappointing and quite surprising to see the experts proceed to warn against abolishing in vivo carcinogenicity tests. What should have been attempted, in our opinion, is an assessment of whether in vitro tests and other approaches, such as ttC, could provide not only adequate safety factors but also decrease the impact on the development of new ingredients.
According to Kirkland et al. (2005), 93% of 553 rodent carcinogens were detected in at least one of the three most common in vitro genotoxicity tests (Ames-test, Mouse lymphoma Assay, and in vitro Micronucleus test or Chromosomal Aberrations test). Combinations of two and three test systems had greater sensitivity than individual tests, resulting in sensitivi-

Criticisms of the experts' draft report
While this chapter provided an honest assessment of the current requirements for carcinogenicity testing based on the SCCS Information Requirements (SCCP, 2006), it also fell into the high fidelity fallacy trap and did not provide an adequate assessment of the validity and reliability of either in vivo or in vitro methods.
As stated by experts, the two-year cancer bioassay is rarely conducted as it is costly, lengthy, and has animal welfare implications. the SCCS do not require carcinogenicity tests unless "considerable oral intake or dermal absorption is expected." This is confirmed by Pauwels et al. (2009) who showed that carcinogenicity data were seen in less than 40% of submissions to the SCCS between 2000 and 2006. Given that these submissions are for cosmetic ingredients of particular concern (dyes, preservatives, and UV filters) one might expect that the prevalence of carcinogenicity data among other cosmetic ingredient dossi-

Alternative Evidence of validity Status
Step 1 -Low exposure substance, no very strong sensitizers No testing needed if human exposure below 0.91 or 0.30 μg/cm 2 skin area dependent on chemical class (Keller et al., 2009).

Threshold of Sensitisation
Applicability to skin sensitization evidenced with a SCCS on-going evaluation for cosmetics.

hCLAT (human Cell
Evaluated by 5 labs (P&G, Shiseido, Kao, Henkel and JaCVAM (lead) -ECVAM pre-validation Line Activation Test) L'Oreal) since 2004 (Anon, 2008). study on-going (results expected 2011). Studies at Shiseido show 93% correct predictions in 29 chemicals (Sakaguchi et al., 2009) and 84% agreement in 100 chemicals . tion assays (CtA) seem to offer the most promising replacement options. the experts did not provide details on the predictivity and reliability of these tests, however, and therefore appeared too hasty in their dismissal of the opportunities the tests may provide for complete replacement. Data on rodent and human predictivity of the Syrian Hamster Embryo (SHE) assay published by Long (2007) showed that SHE has a concordance with the rodent bioassay ranging from 85% (SHE pH ≥7) to 74% (SHE pH 6.7), sensitivity 92% (SHE pH ≥7), specificity 85% (SHE pH 6.7) and predictivity 88%. A meta-analysis done by the OeCD indicated that the three CtA assays have an overall sensitivity of 90% of class I (known) and 95% of class II (possible/probable) human carcinogens (OeCD, 2007). In comparison to this, the rodent bioassay was calculated to have a sensitivity of 50% or 90% on human carcinogens, depending on how the results are interpreted (Ennever and Lave, 2003). The SHE (both pH ≥7 and pH 6.7) correctly identified 100% of the 44 inorganic human carcinogens tested and was able to identify 9 out of 11 organic carcinogens -a sensitivity of 82% (OeCD, 2007). "the limitations to these tests seem minimal, provided that cell clones that retain enough metabolizing capacities to detect different classes of chemicals acting as genotoxic compounds through the formation of stable adducts to DNA are used and more than one in vitro test is performed to improve reliability and predictability, the complete replacement looks like a real possibility" (Annamaria Colacci, personal communication).
In the QSAR section, we suggested additional work by Fjodorova et al. (2010) under the eU Framework CAeSAR project and two QSAR models for carcinogenicity developed by Contrera et al. (2007). the section on ttC was well developed, although it was almost forgotten in the Conclusions, which stated that when repeated dose toxicity is banned, methods for quantitative detection of non-genotoxic carcinogens will be limited to tools such as read across, QSAR, and ttC. "It should be better explained that the use of ttC and read across are not a limitation for safety but for the development of new cosmetic ingredients. The NOAEL and the application of the safety (better to say uncertainty) factor is not always considered the best approach to protect human health, and it must be remembered that the "safety" factor is an arbitrary number that is applied to take into account interspecies and intraspecies differences when extrapolating from animal studies, thus it is not the panacea" (Annamaria Colacci, personal communication.)

Our proposed approach
We propose that genotoxic carcinogens can be identified by a number of long-standing in vitro cell based tests. these tests allegedly have been over-sensitive, but newer tests are more predictive. A more complete method based on CtA also has been in use for more than 40 years but only recently entered an eCVAM pre-validation study. experts agree that carcinogenicity studies are rarely conducted, as they are expensive and time-consuming, are not specified under the Cosmetic Directive, and are rarely requested by the SCCS. A combination of the accepted in vitro genotoxicity tests, the CtA assay, and exposure-based ttC approaches (providing a precautionary approach for consumers) should be the preferred approach, see table 3. ties of around 90% or more, depending on test combination. The specificity of the Ames test was reasonable (73.9%), but all mammalian cell tests had very low specificity (i.e. below 45%), and this declined to extremely low levels in combinations of two and three test systems. the experts highlighted the impact of the risk of detecting false positives with the in vitro genotoxicity assays, using the example of a review of hair dyes from Speit (2009). In this review, the sole use of in vitro tests may have resulted in a number of false positives and therefore withdrawal from the market of these products. the application of the over-protective criteria may not be a limitation, however. Indeed, protecting the public is far more important than producing new hair dyes. In the absence of human data, it is indeed possible that these false positives are not "false" at all. It should also be noted that a significant proportion of the hair dyes were deemed safe in this assessment. Nonetheless, the report authors did not appreciate the impact of a new strategy to reduce the percentage of false positive in vitro genotoxicity tests, thus increasing test predictivity, with respect to the need for in vivo genotoxicity/carcinogenicity testing (Fowler et al., in press;Kirkland and Fowler, 2010;Parry et al., 2010).
No data on validity and predictivity of in vivo tests is given, and therefore a neutral evaluation cannot be conducted. According to ennever et al. (1987), the sensitivity of animal bioassays is very high (all definite human carcinogens adequately tested were positive). The specificity is low, however (in 20 of the probable non-carcinogens tested for rodent carcinogenicity in animal bioassays, 19 were positive and only one was negative). little attempt has been made to validate the lifetime rodent bioassay against human carcinogenicity (ennever and lave, 2003). A survey of the US environmental Protection Agency database to assess the human utility of animal carcinogenicity data showed the animal data were predictive for 42% of chemicals. For the 128 chemicals with human or animal data assessed, however, human carcinogenicity classifications were similar only for those 17 possessing significant human data. The authors concluded that the problem with animal carcinogenicity tests is not their lack of sensitivity for human carcinogens, but rather their lack of human specificity .
A retrospective analysis conducted using the National Toxicology Program database on sixteen chemicals that may lead to liver, lung, or kidney tumors in two-year rodent cancer bioassays -and for which short-term data also were available -showed that cancer often is secondary to a biological precursor effect, the mode of action sometimes is not relevant to humans, and key events leading to cancer in rodents from nongenotoxic agents usually occur well before tumorigenesis and at the same or lower doses than those producing tumors (Boobis et al., 2009). the authors concluded that the two-year bioassay in rats and mice is, at best, only an indicator of potential hazard. Similar conclusions were reached by Ward (2007), who observed that rodents do not commonly develop the spontaneous tumors most prevalent in humans, including those of the colon and prostate. This is due, in part, to differences in genetics, diet, specific natural chemical exposure, and infectious agents.
In addition to the genotoxicity assays that already have been validated and received regulatory approval, the cell transforma-so we referred them to the repeated dose chapter for evidence of more extensively developed models. the positive and thorough assessment of the potential for replacing the toxicokinetics endpoint was not reflected, however, in the timelines given by the experts, which seemed overly conservative given the description of the status of these methods in the text. In addition, the report did not give enough emphasis to the fact that the updated OeCD test Guideline 417 on toxicokinetics already foresees the use of in vitro studies with microsomal fractions to address metabolism or the potential for induction of biotransformation, the use of in vitro dermal absorption studies to characterize absorption, and the use of toxicokinetic modelling for the prediction of systemic exposure and internal tissue dose.
the experts explain that, in fact, toxicokinetics is rarely a requirement under any safety legislation but is considered a

Criticisms of the experts' draft report
Overall, we considered this chapter the most comprehensive and balanced, with the experts approaching the problem with the assumption that animal testing would be banned in 2013, their so called "2013 non-animal approach scenario." this enabled the experts to be more creative in their analysis of how toxicokinetics could be studied, given this scenario. the experts concluded that the approach to toxicokinetics from an in vitro or in silico basis was "well understood" and that "a whole array of in vitro and in silico methods at various levels of development is available for most of the steps and mechanisms that govern the toxicokinetics of cosmetic substances." they expressed concern that renal models are less well developed,

Alternative Evidence of validity Status
Step

-Low exposure substance, no high potency carcinogen (aflatoxin-like, azoxy and N-nitroso compounds)
No testing needed if human exposure below 1.5 µg/day for chemicals with no structural alerts for genotoxicity and 0.15 µg/day for chemicals with structural alerts for genotoxicity.

TTC
Values derived from Carcinogen Potency Database SCCS on-going evaluation for cosmetics. (CPDB) including data on more than 700 chemical carcinogens (Kroes et al., 2004). Proposed use with genotoxic impurities in drugs (Bercu et al., 2010).

In vitro gene mutation assay
90% of 553 rodent carcinogens detected when Accepted for regulatory purposes in mammalian cells (MLA) combined with MNT and Ames test (Kirkland et al., 2005). (OECD TG 476, 1997).
In vitro Chromosome 85% of 553 rodent carcinogens detected when Accepted for regulatory purposes Aberration assay in combined with Ames test and MLA (Kirkland et al., 2005). (OECD TG 473, 1997).

Cell Transformation Assays
Assays established since late 1960s.
to its forward-looking nature, the report did not evaluate current in vivo methods, which, in our opinion, are limited due to the significant differences in metabolism and physiology between animals and humans. For example, before in vitro Absorption, Distribution, Metabolism and excretion studies (ADMe) on human cell models were routinely used by the pharmaceutical industry, the failure rate of drugs in clinical trials due to poor prediction of ADMe was 40% (Kola and landis, 2004); now it is only 10% (McKim, 2010).

Our proposed approach
As described by the experts, relevant stages of toxicokinetics can be modeled using mathematical Physiologically Based toxicoKinetic models (PBtK). these models consist of a set of physiological and chemical parameters that can predict the distribution and excretion of substances through the human body following initial input of information on absorption and metabolism. the pharmaceutical industry has used these tests with growing sophistication since the 1970s (Andersen, 2003). the skin is the main route for the absorption of cosmetics and can be modeled using the regulatory approved in vitro skin method. Metabolism can be predicted through the use of high-"nice to have" endpoint, a matter over which they expressed regret. the point, nonetheless, is that, toxicokinetic information, although useful, is not specified in the Cosmetic Directive and is not considered a "core requirement" by the SCCS (SCCP, 2006). Not surprisingly, therefore, the survey of dossiers submitted to the SCCS between 2000 and 2006 found that fewer than 50% of dossiers included in vivo toxicokinetic data (Pauwels et al., 2009). It appears that in vitro methods (skin absorption) and the use of in silico physiologically-based pharmacokinetics (PBPK) models is already commonplace, and so it was disappointing not to see examples of current (as opposed to future) company strategies for toxicokinetics, as attempted in the repeated dose chapter.
We also found a few areas where additional information on the utility and validity of models could be found. For example, according to published evidence, the suitability of the artificial PAMPA-skin for skin absorption (Ottaviani et al., 2007) and of the in vitro HaCat cell model (Goebel et al., 2009) is already partially established for cosmetics, thus the estimated timeline to enter pre-validation should have been sooner than 2013. the same timeline update was suggested for metabolic activation, for which the Ames test is available. Finally, perhaps due in part

Alternative Evidence of validity Status
Step 1 -Determine likely absorption Conduct a skin absorption assay and use together with physicochemical properties to determine likely systemic absorption through the skin or other routes.

In vitro dermal
In vitro-in vivo correlation evidenced since early 1980s Basic criteria for the use for cosmetics absorption test (Bronaugh et al., 1982). first published by SCCNFP (now SCCS) OECD experts agreed in 1999 that there was sufficient in 1999 (SCCNFP, 1999). data to support the Test Guideline (OECD, 2004).

Step 2 -Determine Distribution, Metabolism and Excretion
If absorption possible, use input from absorption assay to model using a combination of PBPK models and in vitro assays.

PBTK computer models
80% correct in vivo predictions of distribution for Use proposed by EFSA for pesticide 123 drugs within 2-fold error (Poulin and Theil, 2002). residues in food (EFSA, 2007).

In vitro assays on
Review of studies concluded that hepatic clearance Being pre-validated by ECVAM in 2011.

hepatocytes (liver cells)
could be predicted using human liver microsomes Included in regulatory guideline (Chiba et al., 2009). (OECD TG 417, 2010). Retrospective analysis on 50 drugs found human liver cells are as predictive as animal tests (Hosea et al., 2009). In vitro tests with PBPK modelling (SCHH-PBPK) gave better prediction accuracy for humans compared to in vivo rat and dog (Yamazaki et al., 2011). now been improved to increase both its applicability (Dartel, et al., 2010) and its speed of conduct (Peters et al., 2008a,b) and to account for metabolism (Hettwer et al., 2010). This was not considered by the experts. No references of industry use of the EST were given. For example, Pfizer uses the EST to make compound development decisions and Johnson and Johnson also have developed an automated system for HTP of the EST (Peters et al., 2008a,b). Most recently, West et al. (2010) found that the model was able to correctly predict the teratogenicity of seven out of eight blinded drug treatments, with a specificity of 100%, sensitivity of 80% and overall accuracy of 88%.
the experts' coverage of more complex assays, such as the endocrine transcriptional assays, was confused and incomplete. Some methods were featured in the table of methods but were not referred to within the text, and some methods that were referred to in the text did not appear in the table. this is of particular concern when some of these methods are considered validated or in the process of being validated -for example, the estrogen receptor transcriptional assay (lUMICell-eR) and the Stably transfected transcriptional Activation assay (SttA), now OECD TG 455.
In conclusion, given that several receptor binding assays are in draft form or validated at OeCD, there are some valid QSAR models, and several in vitro assays have been validated by eC-VAM, it would have been appropriate to present a draft testing strategy similar to one we offer in Table 5. Indeed, this is what the ReProtect study has just done in its "feasibility study" (Schenk et al., 2010). We think the experts should have considered the outcomes of the ReProtect feasibility study and used these as a basis for discussion of next steps. We are not alone in this opinion; Dr Spielmann has made the same point (Spielmann, 2010). At the very least, the expert report could have illustrated the key stages of the reproductive cycle that should be covered and suggested where the gaps in available methods are, whether this is in applicability domain, predictability, or coverage of predictive endpoints. It may not be assumed that all aspects of the reproductive cycle are (equally) crucial to be represented (by alternative methods). For example, Bremer (2008) stated that "I must insist that it is embryonic development, rather than fetal development, which is the principle cause for concern, since organogenesis is the most sensitive phase in the developing child." Our proposed approach Several methods, including whole embryo cultures, stem cell tests, and receptor binding assays have been developed and are either validated according to eCVAM principles and/or are already OeCD guidelines. We argue that, individually, some of these methods already show sufficient predictability of human effects across a range of test chemicals, see Table 5. It may not be necessary to cover all stages of the reproductive cycle, as some are more sensitive to chemicals than others. For example, the eSt covers the development of the embryo, which is a very sensitive period. thus a combination of these methods, covering the most sensitive endpoints in the reproductive cycle, may already be able to predict reproductive toxicity to an acceptable level of certainty. the eU ReProtect project recently concluded throughput assays on cultured human hepatocytes, which are commonly used in most pharmaceutical companies. A proposed approach, summarized in table 4, would not provide a complete ADMe analysis but may provide adequate data to help make safety decisions.

Reproductive toxicity (chapter 5)
Criticisms of the experts' draft report this chapter was one of the weaker chapters in that it did not justify many of its conclusions on the utility of the methods and, inexplicably, it omitted from its table of methods in vitro and QSAR models that had been discussed. While the chapter was clear in its assessment of the need for reproductive toxicity testing for cosmetics (tests are only required if "considerable oral intake or dermal absorption is expected" (SCCP, 2006)), it was overly negative regarding the value of existing in vitro methods and unimaginative in its approach to the 2013 deadline. the experts in this chapter failed to consider the ttC approach and did not quantitatively assess the validity of in vivo or in vitro or in silico methods. Much of the chapter was devoted to describing qualitatively the various in vivo methods available, ignoring the fact that for cosmetics ingredients only the developmental toxicity guideline (OeCD tG 414) tends to be used (Rogiers and Pauwels, 2008).
No evidence of the validity of the in vivo test methods was given. this is a concern because there is evidence that the predictive power of the prenatal developmental toxicity test is rather poor. For example, Hurtt et al. (2003) found that the positive predictivity of one species to teratogenic effects in rat, mouse, or rabbit was around 60% for 105 veterinary pharmaceuticals. Bailey et al. (2005) found that the rat was positively predictive of 35 known human teratogens in 61% of cases and the rabbit in 41%.
Failure to assess the in vivo methods is also a concern because it does not allow a fair comparison with potential in vitro or in silico methods, which may have similar or better predictivity. No data on the predictive capacity of the Whole Embryo Culture (WeC) test (Genschow et al., 2002), the micromass test (MM) (Spielmann et al., 2004), or the zebrafish embryo test (Selderslaghs et al., 2009) was given, even though it is easily available in the references given here. Perhaps more crucial, no information is given on the validation outcomes of the embryonic stem cell test (EST), a more commonly used, more refined test. Indeed, the test was omitted from the table of methods used by all chapters to provide an overview of all methods. the eCVAM validation in 2002 stated that "the correlation between the in vitro data and in vivo data was good (accuracy 78%) and the test proved applicable to testing a diverse group of chemicals of different embryotoxic potentials" (eSAC, 2002). the expert report stated that the eSt method is limited but did not provide details relating to this comment nor to the counterpoint to this criticism. For example, both Spielmann (2009) and Combes (2009) criticize the ReProtect study that appeared to demonstrate a weakness of the test because it changed the classification from that used in the original eCVAM validation study. the eSt has Tab. 5: An alternative approach for reproductive toxicity

TTC
Values derived from fertility and developmental toxicity SCCS on-going evaluation for cosmetics. data (oral and inhalation exposure) on 91 chemicals (Bernauer et al., 2008).
Step 3 -Higher exposure substance, non-embryotoxic If human exposure exceeds TTC levels, and the substance is non-embryotoxic, perform a combination of fertility and endocrine in vitro tests to determine likely effects on fertility.

Computer Assisted Sperm
Test evaluated by two different laboratories in more Pre-validated in ReProTect project.
Testicular fragment culture 82% expected results on 11 chemicals Needs to be taken forward for (Freyberger et al., 2010b). prevalidation.

Sertoli cell test
"Good" results in two laboratories for seven chemicals Needs to be taken forward for (AXLR8, 2010). prevalidation.

Endocrine Effects
Estrogen receptor alpha Bayer Schering study showed it reliably ranked Part of OECD/ReProTect project, binding assay compounds with strong, weak, and no effect with high expected to go to ECVAM validation. accuracy on 12 chemicals (Freyberger et al., 2010a).

Estrogen Receptor (ER) -
Bayer Schering pre-validation study showed good ECVAM pre validation report due 2011, Transcriptional Activation accuracy on 16 chemicals and good inter laboratory expected to go to ECVAM validation.

AR CALUX reporter
Inter-laboratory study on 64 chemicals showed Pre-validated in ReProTect (AXLR8), gene assay 74% agreement (Sonneveld et al., 2006). expected to go to ECVAM validation. Pre-validation study showed excellent agreement for 14 out of 16 chemicals ( Van der Burg et al., 2010). Up to 85% agreement with rabbit test for 50 chemicals (Sonneveld et al., 2011). approaches to waive animal testing, such as exposure-based techniques. Not all reports covered the possibility of waiving animal tests using the ttC approach, which is an approach used for substances that are applied in very small quantities. Coverage of other approaches such as QSARs and ItS also varied between the chapters, with the chapter on repeated dose covering these but not the chapter on reproductive toxicity. In contrast to the other chapters, the chapter on toxicokinetics started from the basis that animal testing was no longer permissible, and a more forward-looking assessment was the result. Similarly, the chapter on repeated dose included company strategies, which was helpful in identifying what is being used in industry practically as opposed to theoretically. It showed that companies already are applying imaginative approaches to avoid testing on animals and placing potentially harmful ingredients on the market. Our paper summarizes these comments but also provides the kind of information and approach that we expected to see. the alternative approaches for each endpoint presented here are not meant to be complete answers to the problem but rather to provide a good basis for discussion about the immediate utility of various methods, if not to provide reassurance that some methods soon will be a complete answer. It would have been useful if the experts had started from this point and then -as was requested of them -highlight any areas that are not yet adequate and provide timelines for when they might be. In most cases, the experts failed completely to provide reasonable timescales for a new deadline for testing or a strategy for the steps that need to be made to ensure that alternatives are available. Some reports failed to suggest possible deadlines at all.
Finally, the possibility of extending the 2013 deadline in the text of the Cosmetic Directive only applies to Repeated Dose, toxicokinetics, and Reproductive toxicity. there has been a subsequent assumption over the years on the part of the Commission that the term "repeated dose" also includes skin sensitization and carcinogenicity endpoints, which we believe is incorrect. the evidence that replacements are nearly validated for these endpoints should help convince legislators that any extension to the 2013 deadline should not be applied to these tests. that a battery of cell tests "allowed a robust prediction of adverse effects on fertility and embryonic development" (Schenk et al., 2010), with a combined accuracy of between 70-100% for 10 test chemicals (AxlR8, 2010). the use of these tests should be viewed in the context of the poor predictivity of the animal test and the fact that these tests are not always considered mandatory -due, in part, to the low exposure of humans to individual cosmetic ingredients. those companies that voluntarily undertake reproductive toxicity tests usually only do the developmental toxicity test (Rogiers and Pauwels, 2008), which the eSt may effectively replace. In addition, the threshold of toxicological concern (ttC) approach has demonstrated feasibility for reproduction end points for chemicals generally (Bernauer et al., 2008) and may also be used in certain cases when exposure is low.

Conclusion
In general, most chapters took the approach that the animal test has to be mimicked in full, a common, yet arguably, incorrect assumption called the "high fidelity fallacy" (Russell and Burch, 1959). Not surprisingly, the experts believed it would take a very long time to mimic the animal test completely. We disagree that completeness is more important than or as important as predictivity, and we would have liked to see a thorough assessment of this. While we were pleased to see a consistent approach to the presentation of all methods through the use of a table, this did not include a quantitative assessment of their suitability and therefore made it easier for some methods to be dismissed without apparent evaluation of the data on their predictive capacity. examples of highly predictive tests that were not considered effective replacements because of this included the embryonic stem cell test for developmental reproductive toxicity, the peptide reactivity tests for skin sensitization, and the cell transformation assays for carcinogenicity. the reports also were very inconsistent in their approach to the problem, their evaluation of the need for animal testing, the reliability of animal tests, the status of alternatives, and other

Alternative Evidence of validity Status
Estrogen receptor All 28 estrogen disruptors were detected ICCVAM validation report expected 2011. transcriptional assay, (Gordon and Clark, 2005).

H295R Steroidogenesis
78% accuracy for testosterone effect on 18 chemicals, Validated by OECD/EPA in 2009. assays based on a human 88% for estradiol effect on 16 chemicals (OECD, 2009). Draft OECD test guideline being cell line "Overall, these results indicate that…the H295R discussed. would always flag a chemical as a potential disruptor of steroidogenic processes or a reproductive toxicant" (OECD, 2009).
In conclusion, the reports were not a complete evaluation of the status of alternative methods, as they are not detailed enough and are inconsistent in their approaches. the fundamental problem, albeit a commonly held one, is an assumption that all aspects of the in vivo approach would need to be modeled in order to provide complete replacements for these endpoints. this inappropriate assumption has naturally led the experts to an "easy" conclusion -i.e. that the 2013 deadline will not be met. this has been a missed opportunity to review the status of alternatives thoroughly, discuss the genuine obstacles objectively, and provide a workable framework for replacement. Other experts have criticized the report along similar lines (see Balls and Clothier, 2010;Spielmann, 2010). As they did, we recommend that the Commission not pay much heed to this report unless or until substantial amendments are made in the final version. We look forward to reading the final report.