Opinion Versus Evidence for the Need to Move Away from Animal Testing

For the 10th anniversary of Food for Thought ... in ALTEX, it seemed appropriate to summarize what we have learned on this journey with respect to the core subject of this journal: the need for alternatives to animal experimentation. The series has mostly focused on toxicology, but here the aspects that apply also to drug development and basic research shall be considered. Sure, we need animal models – when we want to study animals. For example, we have to test drugs for animals in animals. However, when studying human physiology, pharmacology and Food for Thought ... Opinion Versus Evidence for the Need to Move Away from Animal Testing


Introduction
For the 10 th anniversary of Food for Thought … in ALTEX, it seemed appropriate to summarize what we have learned on this journey with respect to the core subject of this journal: the need for alternatives to animal experimentation. The series has mostly focused on toxicology, but here the aspects that apply also to drug development and basic research shall be considered.
Sure, we need animal models -when we want to study animals. For example, we have to test drugs for animals in animals. However, when studying human physiology, pharmacology and Food for Thought ...

Thomas Hartung
Chair for Evidence-based Toxicology world-wide, endowed by the Doerenkamp-Zbinden Foundation. These activities aim to bring Evidence-based Medicine to toxicology, i.e., the systematic, objective, and transparent test method assessment and decision-making based on test results. This shall limit bias, prejudice and identify the limits of our knowledge -it is thus exactly the opposite of the shortcuts opinion enables.
Opinion is defined by the Oxford dictionary as "A view or judgement formed about something, not necessarily based on fact or knowledge". Hippocrates (ca. 460 -375 BCE) is quoted "There are in fact two things, science and opinion; the former begets knowledge, the latter ignorance." But can we actually avoid opinion in science? The Roman emperor Marcus Aurelius Antonius (121 -180 CE) stated "Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth." I strongly believe that opinion cannot only not be avoided, but is in fact highly valuable as long as we clearly distinguish it from factual evidence and make clear where evidence ends and where opinion starts. First, opinion helps to fill the gaps, for which we have no evidence yet. Expert advice is better than nothing, much better in fact. Second, it is much more entertaining and inspiring: You cannot argue facts, but a good hypothesis -true or not -can spark ideas, controversy, etc. Already as a student, I largely passed on all lectures that were only conveying textbook knowledge (I could learn this for the exams from the textbooks without my faulty notes), but savored those which were spiced with opinion. So, this has been my goal in my talks, lectures and some of my articles.

Animal tests are costly and resource intensive
It is difficult to apply economic considerations to all animal experiments in basic research and drug development, as we did for safety testing Hartung, 2009, 2010;Bottini et al., 2007): approaches are so diverse, especially between drug industry and academia, that costs and benefits cannot be contrasted easily.
Toxicological studies become resource intensive for three reasons: (1) they are typically done under Good Laboratory Practice (GLP) quality standards, (2) they treat animals for long periods of time and (3) they assess many endpoints to gain maximum information and avoid missing any harmful effect.
All of this is avoided in other types of research to conserve financial resources, but also because it demands large quantities of material (many kg of test substance) and leads to multipletesting problems, which decrease the statistical power. In a very simplified view, the economic efficiency of animal tests is determined by whether new, important research is produced or a new drug comes to the market. For both, the impact of a single experiment cannot be judged. This often has more to do with perception than with objective impact. Some people in academia appear to believe that no new line in a textbook can be produced without a new knock-out mouse. We will never know how many wrong decisions are taken in drug development be-cause of misleading animal tests. The performance figures of the few tests analyzed and the drain of the drug pipeline would suggest a substantial number of such mistakes.
Evidence vs. opinion as to economic considerations While costs and duration of toxicological studies are clearly prohibitive to satisfy societal safety needs, e.g., the often-quoted example of cancer bioassays at $1 million and four years per chemical, this argument is difficult to make for research outside of toxicology.
4 Ethics -where there is an alternative, we must use it instead of harming animals! Ethical aspects can be left aside here -they should be a nobrainer: It is not only criminal, but no sane person will make animals suffer if there is no need to do so, i.e., an alternative is practically available.

Evidence vs. opinion as to ethical considerations
No evidence is needed if alternatives are available. But whether they are available depends, outside of toxicology where we have formal validation and regulatory acceptance, largely on opinion. It is not realistic to formally validate alternatives for the majority of models in basic research and drug discoverythere are too many models and model variants, and the methods used change too quickly. It is thus critical to shape opinion by informing, teaching the objective assessment of their value, and creating doubt in current practices.

Animal experiments are not sufficiently reproducible
They are at least not reproducible enough to work with the group sizes that are typically used. Noteworthy, what is meant by "reproducibility" needs some sharpening here, as it means different things in different disciplines and areas (Goodman et al., 2016). I will choose examples from toxicology, my own "turf", but disease models have been systematically reviewed and summarized before (Hartung, 2013), finding no striking differences.
Arguably, toxicology is an area where we can expect the best reproducibility: Protocols have been standardized over decades into international guidance, much work is done under GLP quality assurance, we use high ("maximum tolerated") doses of substances, and, unlike in pharmacology, we do not have to induce artificial diseases in toxicology. We also pay incredible fees to have the experiments performed by trained professionals: A cancer study in one species for one substance costs $1 million (Basketter et al., 2012), an inhalation study $2.5 million (Hartung, 2016); a developmental neurotoxicity study costs $1.4 million (Smirnova et al., 2014). These are budgets one can only dream of in academia, where our young-as first checks suggest (Baker et al., 2014), showing no real improvement in reporting. Notably, these findings apply to the scientific literature, not to the guideline studies used to estimate reproducibility in a type of "best-case scenario" above. A big problem is the generally poor statistics used in publications (Altman, 1998), especially when addressing many effects in the animal at the same time (a chronic toxicity test has 40, a cancer bioassay 60, and a reproductive toxicity study 80 endpoints, without any corrections for multiple testing if using statistics at all). A prominent call for improving reporting of clinical results was published by Landis et al. (2012).

Evidence vs. opinion as to reporting quality
There is clear evidence that reporting standards for animal experiments are not adequate. This does not mean that they are any better for in vitro work….

Study Design -animal experiments are statistically underpowered, which is compensated by so much standardization that they no longer reflect even their own species
Standardization of animals reduces natural variability and, thus, dramatically reduces the probability of significant findings. We often use inbred strains (genetically "identical twins"), almost always of the same age and gender; in best cases, we randomize for weight differences, etc. We also keep the animals free of any diseases (as "specified pathogen-free") and standardize cages, temperature, and feed. All of this is helpful to improve reproducibility, but our results will also only reflect this exact condition.
The problem that standardization instead impairs reproducibility has been recently discussed (Voelkl and Würbel, 2016). There is ample literature on how these factors impact on results, e.g., strain (Anon, 2009), genetic drift (Papaioannou and Festing, 1980), gender (Clayton and Collins, 2014), cages (Castelhano-Carlos and Baumans, 2009), lack of enrichment of the environment (Wolfer et al., 2004;Würbel, 2007), feed, temperature, diurnal rhythm, time of the year (Kiank et al., 2007), etc. Nevalainen (2014) summarized some of the influential factors, including also seasonal cycle, reproductive cycle, weekendworking day cycle, cage change and room sanitation cycle, diurnal cycle, in-house transport, caging, temperature, humidity, illumination, acoustic environment, odors, cage material, bedding, complexity items, feeding, kinship and humans. They concluded, "Laboratory animal husbandry issues are an integral but underappreciated part of experimental design, which if ignored can cause major interference with the results".
In no case was this comprehensively assessed for any given animal test. The reported impacts are anecdotal, it is difficult to say how they jointly impact and how our often-arbitrary choices or lack of control of a parameter distort results. However, it is clear that hardly any experiment shows a general result for a given species. est students do most of the work while "learning on the job", though this does not prohibit us from publishing their work (Hartung, 2013).
The cancer bioassay is a good test case for a reproducibility assessment of animal tests: More than 3,500 studies have been amassed -at today's cost that is $3.5 billion spent. 13% of studies give equivocal results (Seidle, 2006) and the reproducibility was 57% for 121 substances tested repeatedly (Gottman et al., 2001). The OECD guidelines do not make randomization and blinding mandatory, and the guideline statistics do not control for multiple testing, despite the fact that about 60 endpoints are assessed. The cancer bioassay might be a difficult case as some colleagues argue, but when looking at the non-cancer endpoints for 37 substances, very little of the earlier chronic studies was reproduced and consistency between genders and rodent species was low (Wang and Gray, 2015).
What about simpler and shorter animal studies? Severe eye irritation is 70% reproducible (Luechtefeld et al., 2016a). Even validated animal tests do not perform much better: The local lymph node assay for skin allergy is 89% reproducible (Luechtefeld et al., 2016b), and the uterotrophic test for estrogenic endocrine disruption has 26% controversial data if repeated (Browne et al., 2015).
These are only assessments of the reproducibility under the optimal conditions of regulatory guideline studies -this does not say that the results are meaningful for humans. Hallmark papers with respect to non-reproducibility of academic research (Begley and Ellis, 2012;Prinz et al., 2011) have alarmed the scientific community (Macleod, 2011;McGonigle and Ruggeri, 2014;Jarvis and Williams, 2016).

Evidence vs. opinion as to reproducibility
There is increasing evidence that we have a reproducibility problem to which animal experimentation is contributing. Still, more systematic analyses are needed to form a point of reference.

Animal experiments are not reported well enough
Efforts to develop guidance on how to report animal studies led to the ARRIVE guidelines (Kilkenny et al., 2010). So, we know what should be reported when writing a scientific paper. When applying this standard and comparing with the reality of 271 randomly picked studies (Kilkenny et al., 2009), the results are more than disappointing: "Only 59% of the studies stated the hypothesis or objective of the study and the number and characteristics of the animals used. … Most of the papers surveyed did not use randomisation (87%) or blinding (86%), to reduce bias in animal selection and outcome assessment. Only 70% of the publications that used statistical methods described their methods and presented the results with a measure of error or variability." More than 300 journals have adopted the ARRIVE guidance, but this seems to be mainly lip-service With respect to toxic effects, we typically study the acute and local effects of high doses in animal experiments, relevant if at all in workplace situations. However, for general human health, we should be concerned about low and chronic exposures. We are exposed to mixtures of chemicals in and from the different products. Differences in the kinetics and metabolism of substances add to the problem. The human organism often varies dramatically from the animal with respect to uptake, distribution and excretion of substances, and forms very different metabolites of the same substance.

Evidence vs. opinion as to not reflected human diversity
There is no doubt about this, but also not any answer showing how to tackle the problem. Panels of human cells representing diverse individuals would work only in a few cases.
10 What can we improve in our animal experiments? Table 1 shows a personal scoring for the available evidence:

Evidence vs. opinion as to study designs
There is clear qualitative evidence that many impacting factors are either not controlled or standardized to an extent that results are no longer generally applicable. There is no quantitative evidence for most of these factors though. Opportunities to remedy these problems are limited by feasible group sizes, as most designs are already underpowered.

Animal experiments do not even predict other animal species
I am often quoted for the rather simple statement "Humans are not 70 kg rats!" (Hartung, 2009). But rats are also not 300 g mice! The difference here is that we can compare because some highly standardized (toxicological) tests are being done on more than one species (Leist and Hartung, 2013). The results are discouraging: mice and rats predict each other for carcinogenicity of chemicals by 57% (Gray et al., 1995), and this value drops if we also look for prediction of the target organ that is affected (Gold et al., 1991). Rats and rabbits (as well as other species) predict reproductive toxicity of each other by 60% (Bailey et al., 2005). Guinea pigs and mice predict skin sensitization of each other in 77% of cases (Luechtefeld et al., 2016b). Mouse and rat have little prediction for each other's chronic toxicities (Wang and Gray, 2015).
There is no reason to assume that any species predicts effects in humans any better (Perlman, 2016) than it predicts effects in another animal species. Hardly any species comparisons have been done for basic research and drug discovery. However, often even differences between mouse strains are reported.

Evidence vs. opinion as to inter-species predictivity
There is clear evidence for tremendous species differences from toxicology, but this is limited for other areas of research. There is no reason to assume that toxicology has more inter-species variances; on the contrary, here substance effects are studied at high doses, most substances act in a manner that is not receptormediated, and, unlike in pharmacology, there is no additional complication of a disease model, in which the substance is tested for modulatory effects.

Animal experiments do not reflect human diversity, exposure, and treatment
The lack of natural diversity in our animal experiments was already addressed. Humans are different from inbred mice in many aspects: our weights, our age, our lifestyle, our genetics, our history of diseases cover broad ranges. This all makes it very difficult to predict substance effects, even more if one is trying to treat diseases that are at different stages in combination with different comorbidities and other parallel treatments. This has nothing to do with the monotreatments in standardized disease models. For a list of differences, see Hartung (2013). Inter-species predictivity ++ ++ Human diversity not reflected +++ +++ Sure, we can improve many aspects of how we do our animal tests (leaving aside all the aspects of reducing distress and suffering of the animals (Zurlo and Hutchinson, 2014)): We can use more genetically diverse animals in enriched environments, study both genders and several species. Richter et al. (2011) high rates of misidentified cells, mycoplasma infections, and genetic aberration in culture challenge this part of research no less than the animal tests discussed here. They will be part of a roadmap to a more comprehensive coverage of human hazards by new approach methods (Basketter et al., 2012;Leist et al., 2014).

Conclusions
This discussion of the shortcomings of first of all animal tests is not a call to abandon them right now. Information that is often not right might still be better than no information at all. It means, however, that in light of these limitations we need really good justification to harm an animal. Only looking for and openly discussing the limitations of an individual animal test will enable us to move forward, and often this means away from the animal model. In some cases, we might not yet have an alternative, but it is important to identify the goal of creating one. The critical step is to understand the strengths and limitations of our models, both in vivo and in vitro. The systematic assessment of study quality (Samuel et al., 2016) is a key step toward such analysis, favorably by systematic reviews (Stephens et al., 2016), as recently proposed for example for endocrine disrupting chemicals (Vandenberg et al., 2016). Then we can start combining them to move towards more meaningful results in integrated testing strategies Rovida et al., 2015).
This all is more easily said than done. Science has too few self-critical and self-controlling mechanisms. Nobody writes more than is absolutely necessary about the weaknesses of the models in scientific papers or grant applications. Those who are more careful and control their models and results are penalized, as they cannot publish as quickly and as much exciting stuff as their colleagues.
But these exciting results are often shaky -the only 10-25% reproducibility of important scientific papers published by the pharmaceutical industry is alarming (Begley and Ellis, 2012;Prinz et al., 2011). Science works by forgetting the irreproducible results over time -we stop citing them. However, the growth of the scientific community and the ever-easier access to literature allows these studies to resurface again and again, cited by those who don't know any better. Given the lottery of our peer-review system and the overload of the experts with review duties, and the business models of publishers, which in the end make everything publishable somewhere, we should not be surprised about the increasingly perceived "reproducibility crisis" (Baker, 2016).
The efforts by NIH and others to address this are laudable, but in the end, we need a "scientific enlightenment movement", a type of restart as Life Science 2.0. Clinical research has started this with Evidence-based Medicine. We need something similar in preclinical research. Efforts of systematic reviews of animal studies (Ritskes-Hoitinga et al., 2014) or Evidence-based Toxi-have actually shown that systematic variation of experimental parameters improves reproducibility.
We can analyze the kinetics of substances in different species, including man, and improve our extrapolation to humans (Bale et al., 2014;Tsaioun et al., 2016), especially by integrating information from in vitro epithelial barrier models (Gordon et al., 2015). We can use and properly report the right statistics, which in turn will strongly increase the necessary animal group sizes. However, all this would make our experiments incredibly expensive.
We can argue that from a given budget it is better to publish fewer but more meaningful results. We can also standardize and validate further animal tests -this would improve the comparability of results and show more clearly the strengths and weaknesses of these models, but again these are lengthy and costly exercises that would likely produce many disappointments about broadly used models.
The traditional way of handling this in toxicology is the safety or assessment factor, i.e., the (no) effect level of a substance is corrected by a factor, typically 10, for possible inter-individual and another factor, typically 10, for inter-species differences. Often additional factors are added if further limitations exist in the data. It is pragmatic to err on the side of safety.
The problem is that such uncertainty factors cannot be modelled in vitro or in silico. They also come on top of additional safety measures, such as choosing the most sensitive species and using high (maximum tolerated) doses. However, neither assessment factors nor high doses help in disease and drug effect models.

What can we do instead of animal experiments?
First, we should study what we can in humans in order to understand human physiology, disease and treatment. We do not really take enough advantage of the ongoing daily exposure of people, i.e., epidemiology, though advances like biomonitoring, biomarkers, biobanking, and the human exposome (Escher et al., 2017) must be noted. Also, microdosing of substances in humans and more comprehensive assessments when first going into humans represent some, though limited, opportunities (Seymour, 2009).
Human tissues and their reconstruction in vitro (Alépée et al., 2014), including bioprinting and organ-on-chip bioengineering, represent the next line of opportunities (Andersen et al., 2014;Marx et al., 2016). The current paradigm shift toward organotypic cultures with organ architectures and organ functionalities presents avenues to more meaningful models.
These prospects should not blind us to the shortcomings of these models and the challenges ahead (Hartung, 2007(Hartung, , 2013Pamies and Hartung, 2017). Their quality assurance, e.g., Good Cell Culture Practice and in vitro reporting standards (Coecke et al., 2005;Leist et al., 2010;, are only under development. Incredibly