Global Analysis of Publicly Available Safety Data for 9,801 Substances Registered under REACH from 2008–2014

Summary The European Chemicals Agency (ECHA) warehouses the largest public dataset of in vivo and in vitro toxicity tests. In December 2014 this data was converted into a structured, machine readable and searchable database using natural language processing. It contains data for 9,801 unique substances, 3,609 unique study descriptions and 816,048 study documents. This allows exploring toxicological data on a scale far larger than previously possible. Substance similarity analysis was used to determine clustering of substances for hazards by mapping to PubChem. Similarity was measured using PubChem 2D conformational substructure fingerprints, which were compared via the Tanimoto metric. Following K-Core filtration, the Blondel et al. (2008) module recognition algorithm was used to identify chemical modules showing clusters of substances in use within the chemical universe. The Global Harmonized System of Classification and Labelling provides a valuable information source for hazard analysis. The most prevalent hazards are H317 “May cause an allergic skin reaction” with 20% and H318 “Causes serious eye damage” with 17% positive substances. Such prevalences obtained for all hazards here are key for the design of integrated testing strategies. The data allowed estimation of animal use. The database covers about 20% of substances in the high-throughput biological assay database Tox21 (1,737 substances) and has a 917 substance overlap with the Comparative Toxicogenomics Database (~7% of CTD). The biological data available in these datasets combined with ECHA in vivo endpoints have enormous modeling potential. A case is made that REACH should systematically open regulatory data for research purposes.


Introduction
The European REACH legislation (Regulation (EC) 1907/2006) 1 prescribed the largest collection of chemical toxicity data in history. REACH aims to collect comprehensive safety information for all substances on the European market in volumes of more than 1 ton per year of production or import volume. Basically, it includes three groups of substances, i.e., substances for which so far no registration was necessary on the European level, substances introduced under the Dangerous Substances Directive, since then with somewhat different registration requirements, and all new substances above 1 ton per year since entering into force of the REACH legislation. The legislation is organized by different deadlines, two of which had passed at the time of data analysis. The first required the registration of substances at tonnage levels above 1,000 tons and those with concerns as to carcinogenicity, mutagenicity and reproductive toxicity (CMR) before December 2010 and the second required the registration of substances above 100 tons per year before June 2013; new substances were added to this, but their number is relatively small (Hartung, 2010). For this reason the analysis is clearly biased toward high-production volume substances.
While computational toxicology has recently seen the collection of several large-scale datasets (e.g., US EPA's ToxCast), the data generated and collected for REACH, owing to its legislative nature, is becoming the largest collection of (eco-)toxicology data relating to in vitro and in vivo endpoints. However, the REACH dossiers are currently proprietary and any workflows involving the public summary data in REACH depend on the slow and errorprone process of manual extraction. Dossiers can be viewed on the ECHA website 2 ; documents are generated by industry via the IUCLID 3 application.
Here we seek to demonstrate the extent and diversity of the REACH dataset -a dataset that far surpasses most existing datasets used for computational toxicology -and show how an open-access REACH program could allow a profound change in computational toxicology. More detailed analyses were performed for ocular, oral and skin endpoints in other publications (Luechtefeld et al., 2016a-c, this issue).

REACH data extraction
Data was downloaded from ECHA using HtmlUnit in an iterative manner in order not to hinder data flow, using an open source Java "Gui-less browser" library (Bowler, 2002). Implementation of ECHA dossier download automation used the functional programming language SCALA (Odersky et al., 2004).
A MongoDB database 4 was generated from REACH data (Chodorow, 2013). Extracted REACH data is stored as a query-able collection of documents in this Mongo database. The database was generated by automated data extraction from ECHA dossier URLs via the SCALA driver ReactiveMongo (Godbillon, 2015). Every document is identified by a unique set of three fields: • ECNumber: Substance identifier ("415-890-1") • type: Study description (e.g., "Exp Key Eye irritation") • num: disambiguates repeat studies (1, 2, 3,…) The constructed database, downloaded December 17, 2014, contains 816,048 such documents with 9,801 unique substances (identified by ECNumber) and 3,609 unique study descriptions. Not every substance was associated with information for every study type.
While ECHA disseminated data is a highly structured dataset, much of REACH data contains natural language for quantitative and categorical fields such as: number of animals, Klimisch score, dates, GHS hazards, dose data, response data, etc. These fields were mapped to numeric or categorical values via regular expression recognizing number words and numbers.
To better enable categorization of studies used for animal endpoints, we enriched studies by categorizing into four groups (InVitro, InVivo, ReadAcross or QSAR / PCHEM) mainly through analysis of keywords (i.e., "read across" in the methods data likely represents a ReadAcross study). The QSAR / PCHEM category refers to quantitative structure activity relationship model studies and physicochemical property studies. Due to an overlap in the language used by ECHA to describe these studies, QSAR and PCHEM are grouped together.
When applicable, guideline identifiers were extracted from study data. Thus all the studies matching a given OECD guideline can be easily queried.
ECHA disseminated data is of a highly nested nature: administrative, reference, results, materials and methods data all have many subfields, and some subfields have their own subfields. The root fields for studies may, but do not necessarily, include: • Reliability: 1 (reliable without restrictions), 2 (reliable with restrictions), 3 (not reliable), 4 (not assignable) and other (Klimisch et al., 1997).
• Study result type: Study descriptions including: estimated by calculation, experimental results, (Q)SAR, read-across from supporting substance, readacross based on grouping of substances, no data, experimental study planned and substances into 881 element binary vectors describing the presence or absence of substructures. Similarity between chemical vectors is calculated via Tanimoto distance. Tanimoto distance, the fraction of shared substructures divided by total number of substructures, is a number between 1 (perfectly similar) and 0 (no similarity): The K-Core algorithm was used to filter out substances with less than 30 neighbors. K-Core, an iterative algorithm, removes substances with the fewest neighbors first until all remaining substances have at least k neighbors. Previous use in protein-protein networks and protein function analysis provide evidence of K-Core's use in discovering useful network structures (Altaf-Ul-Amine et al., 2003;Alvarez-Hamelin et al., 2005;Wuchty and Almaas, 2005). The parameter 30 was chosen to reduce the network to a manageable number of well connected modules.
Module creation-Following K-Core filtration, we used the Blondel (Blondel et al., 2008) module recognition algorithm to identify chemical modules in the K-Core reduced similarity graph. Blondel's algorithm optimizes Q, a measure of network modularity as evaluated by a function of vertex similarity and module assignment: In the above formula: • A ij is the similarity of chemical i and j, • m = Σ i , j A ij is the total sum of all similarities, • k i = Σ j A ij is the sum of similarities to chemical i, • c i is the module containing chemical i, • δ (c i , c j ) is 1 if c i = c j and 0 otherwise. Q takes on values between −1 and 1. Good modularity, defined by stronger similarity between substances in the same modules versus different modules, is observed for networks with Q ≥ 0.3 (Blondel et al., 2008).
Gephi-Gephi, a network visualization tool, was used to construct and analyze similarity networks (Bastian et al., 2009). The code for Gephi is openly available 6 and free to extend or modify.
Force layout-The force layout algorithm (Jacomy et al., 2014) was used for generation of chemical similarity networks. The force layout algorithm works on graphs with nodes and edges. Nodes in a graph are connected by edges. The force layout algorithm treats nodes as charged particles that repel each other and edges as physical connections between these particles. The algorithm then positions nodes via a physics simulation.
Term Frequency x Inverse Document Frequency (TFIDF)-TFIDF was performed on an 881-dimensional "substructure importance" vector by summing the occurrences of all 881 substructures inside a module (module frequency) and dividing by their frequency in all substances (inverse chemical frequency). We denote this MF_ICF or "Module Frequency Inverse Chemical Frequency".

Counts occurrence of structure s i in all substances
Counts occurrence of structure s j in module M i Substructure importance vector for module i M i and M j similarity is measured as the cosine of the angle between both substructure importance vectors given here as the vector dot product over vector magnitudes. Module similarity is measured here as the cosine of module substructure importance vectors.
Toxicity databases-We aggregated data from multiple toxicologically relevant databases for analysis of biological and chemical structure data and its relationship with studies found in ECHA data. available for download 9 . Access to the PubChem and ChEMBL libraries was available through web services 10 (Bolton et al., 2008). Overlaps between databases were found by matching CAS Registry Numbers (CAS RN). The ChEMBL database stores compounds by a unique chemical identifier (ChEMBL ID) and does not contain CAS RN. For this overlay, CAS RN were converted to canonical SMILES and subsequently searched against the ChEMBL library. Because the PubChem and ChEMBL libraries are large and accessed via web services, the overlap between these databases was taken as reported by The European Bioinformatics Institute 11 .
The results of assays found in PubChem and ChEMBL for high production volume compounds were aggregated using the PubChem Power User Gateway and ChEMBL API. The response of a compound in a given assay was recorded independent of the experimental outcome (e.g., active, inactive, inconclusive, etc.). The assays within CTD were available using the batch query portal within the site 12 . Each chemical-gene interaction for a queried compound was recorded as a response.

Extracted data overview
Efforts to determine chemical hazards such as eye irritation, skin sensitization and other health hazards have resulted in the accumulation of large amounts of privately held toxicity data. REACH legislation has resulted in the most extensive effort to systematically collect such data and outlined the necessary additional chemical testing that must be done. The constructed database, downloaded December 17, 2014, contains 816,048 such documents with 9,801 unique substances (identified by ECNumber) and 3,609 unique study descriptions.
Out of the 509,083 studies with a purpose flag in the extracted data, 13.5% (68,866) have the purpose flag "weight of evidence", 2.5% (13,051) "disregarded study", 44.7% (227,417) "key study", and 39. 2% (199,749) have the purpose flag "supporting study" (Fig. 1). Purpose flags can be useful for defining the breadth of database queries; some analyses may only have interest in study results directly used for classification and labeling and should refine their searches to studies with purpose flag "key study".

PubChem chemical similarity
Mapping substances from REACH to PubChem enables the analysis of chemical similarity via PubChem 2D conformational substructure fingerprints (Jaworska and Nikolova-Jeliazkova, 2007;Cheng et al., 2014;Steinbeck et al., 2003). Substructure fingerprints can be used in combination with the Tanimoto distance (number of shared substructures divided by total number of substructures) to build the chemical similarity map in Figure 3. We employed the 2D conformational fingerprint, which treats each fragment as 1 or 0 depending on its presence in a substance. Similarity is calculated as the number of shared fragments divided by the total number of fragments in both molecules. Although other similarity measures exist for binary vectors, we chose Tanimoto for its simplicity (Lourenço et al., 2004). More advanced similarity measures can be expected to perform more strongly than the baseline-setting approach used here.
Large chemical similarity graphs allow both visualization of the global chemical diversity of a dataset and suggest different chemical classes within in the data. In construction of the chemical similarity network, filtering was performed for visualization and identification of network modules. Edges between substances with similarity less than 0.65 were discarded.
Edge filtration and K-Core chemical filtration reduce 3,122 original substances (mapped from REACH to PubChem) to 1,383 and number of edges from 84,993 to 69,041. Preservation of one third of the original population demonstrates the well-connectedness of the entire chemical similarity network. Figure 4 shows the resulting filtered chemical similarity map with substances colored by modularity.
The REACH extraction network modularity Q value of 0.688 demonstrates strong modularity. Supporting evidence of strong modularity comes from visual inspection of the resulting map (with 9 modules given unique colors). Three large disconnected modules can be seen divided into visually reasonable neighborhoods. Edge similarity is visualized via transparency, with opaque edges of higher similarity and translucent edges of low similarity; tightly connected modules are observed to display dark, strongly weighted edges.

Gephi force layout visualization-Layout
and visualization relies on the force layout algorithm implemented within an open source Java network visualization software called Gephi (Bastian et al., 2009). While technical details are beyond the scope of this paper, ForceAtlas distributes edges and nodes by simulating a physical system where nodes repulse each other (like charged particles) and edges attract their attached nodes (like springs) (Jacomy et al., 2014).
Substances are colored by their module number in Figure 4, and several example substances from each module are shown in Figure 5. While the Blondel et al. modularity algorithm provides a strong determination of global modules, it is interesting to consider the intramodule cohesiveness. Module cohesiveness, as measured by comparing similarity between substances in a module to substances outside a module, is the basis for Blondel algorithm module identification (Blondel et al., 2008). For example, visual inspection shows that module 8 is not a very cohesive module and could be broken up into several sub modules, and the chemical examples chosen from module 8 are selected from disparate submodules and do not appear strongly related. Module 2 showed extremely high intra-connectivity and structurally very similar substances -this likely reflects a class for which using a SAR approach could be fruitful. To attempt to investigate and quantify this connectivity we borrowed the "Term Frequency x Inverse Document Frequency (TFIDF)" approach from document retrieval literature (Salton et al., 1975). TFIDF is often used in text-mining to assess the "importance" of a word by calculating its frequency in a given document in comparison to its typical appearance in the broader corpus, e.g., for a word to have a high value it must appear frequently in a document, but infrequently in other documents. We adapted this approach for chemical substructures to examine which substructures were the most informative for each module. Table 1 gives the highest ranking 10 substructures in each module. Table 2 gives the similarity between each module measured in this way. The results help to confirm the validity of the TFIDF approach. Modules that appear visually related (Fig. 4) also have high quantitative similarity. Example substances were chosen from each module to help visualize the module constituency. The examples are given in Figure 5 and help to inform module characterization.

Module analysis-The
Three super modules, modules (1, 4, 6, 8), modules (0, 7, 5, 3), and module 2, can easily be visualized. The two bigger super modules, modules (1, 4, 6, 8) and modules (0, 7, 5, 3), differ mainly in the frequency of straight-chain and cyclic alkanes or aromatic rings, respectively. In the first super module, modules (1, 4, 6, 8), modules 1 and 6 are both long and short chain esters differing only in the degree of saturation of their alkyl chains, explaining the high amount of similarity between the modules. Module 8 showed highlycyclic structures of varying ring size and showed intermodular similarity with module 6 due to the O-C-R substructures contained in the cyclic alcohols and the esters. Another super module, module 2, is based on glycine derivatives that share little similarity with all other modules. The slight overlap with module 4, a module with ester and ether derivatives, comes from the shared O=C-O-R moiety in both groups. The other large super module, modules (0, 7, 5, 3), also shows some obvious feature overlaps. Module 0 is characterized by a high frequency of alcohol derivatives and esters, and showed the highest intermodular similarity with module 7, a module showing a high frequency of thiols. The similarity is most likely owing to the frequency of aromatic cyclic structures with a lone substitution in both groups. Module 3 (quinone and glycine derivatives) and module 5 (dianilines) shared high intermodularity due to the shared aniline backbone.

OECD guideline usage
ECHA studies designate OECD guideline numbers when appropriate. These numbers improve analysis because studies sharing the same OECD guideline can be expected to have similar data formats (materials and methods, results, etc.). Table 3 shows the top 3 OECD guidelines for each enriched category (InVivo, InVitro, QSAR / PCHEM, Read Across). It should be noted that since OECD guidelines are given by ECHA in natural language and were extracted via regular expression recognition, it is possible that some guidelines were extracted imperfectly.
REACH requirements for in vitro skin corrosion, skin irritation, eye irritation, and bacterial gene mutation are described in Annex VII., i.e., for all tonnage bands (Aulmann and Pechacek, 2014;European Commission, 2006). As these endpoints are required for large numbers of substances, they should have a high frequency in the extracted data. Given this constraint, it is surprising that none of the OECD skin sensitization guidelines appear near the top in Table 3. Automatic curation indicates that out of the 9,801 extracted substances 5,551 were missing explicit in vivo key experimental skin sensitization studies, possibly due to data waiving or being substituted by read-across methods. Manual inspection of six online ECHA dossiers of substances missing key experimental in vivo sensitization testing agreed with the automatically extracted results and identified the following: • 919-583-6: No key skin sensitization study given Analysis of the substances missing a key skin sensitization study indicated that out of 637 skin sensitization studies with data waiving, 360 are labeled as "other justification", 255 are classified as "study scientifically unjustified", and 148 as "study technically not feasible". Examination of study result types associated with substances without a skin sensitization study indicate 2,735 read-across from supporting substance, 2,156 read-across based on grouping of substances, 2,144 experimental result, 157 "estimated by calculation", and 128 (Q)SAR. This data indicates that read-across from a supporting substance is a more prevalent study type than read-across from categorization for substances lacking a key experimental skin sensitization study.
TG 401: Acute Oral Toxicity (OECD, 1987) is the third most prevalent in vivo OECD TG in the extracted database. It is also the second most prevalent guideline in the read-across category. REACH stipulates in Annex VII that acute toxicity must be evaluated for all tonnage bands, thus corroborating the extraction's high prevalence (Aulmann and Pechacek, 2014). Overlaps in in vitro and read-across OECD guidelines indicate potentially rich datasets for the evaluation of read-across approaches.
OECD guideline data is used extensively in other publications in this issue evaluating ocular, skin and oral toxicity in more depth (Luechtefeld et al., 2016a-c, this issue).

Hazard distribution
ECHA dossier submissions contain classification and labeling data that can be mapped to hazard definitions given by the Globally Harmonized System of Classification and Labelling. Figure 6 identifies label frequency as reported in extracted ECHA dossiers.
Extracted GHS values exist for 6,186 REACH substances; incomplete GHS extractions are due to the limitations in text analysis and occasional inconsistencies in data format.
The most abundant hazard is H317 "May cause an allergic skin reaction" with 1,255 (20%) labelled substances, 4,317 (70%) substances with "conclusive but not sufficient data for classification" (which designates that data are available indicating no need for classification), 428 (6%) substances recorded as "data lacking", 26 (0.4%) substances recorded as "inconclusive" and 160 (2.5%) substances for which data extraction failed. The high frequency of this hazard, the relatively well-established Adverse Outcome Pathway (AOP), as well as the relative ease of using in vitro tests for various steps of the pathway make it an ideal test case for further research into Integrated Testing Strategies (Hartung et al., 2013). For a more detailed analysis of the skin sensitization data see Luechtefeld et al. (2016c, this issue).
The information on hazard frequencies in Table 4 can be used as estimates for hazard prevalence to anchor testing strategies .

Animal use
The number of animals used in REACH data sources can be extracted simply from Materials and Methods data. In a given study the number of animals used is given in natural text, e.g., "5 males and females". We wrote heuristics for extracting animal counts from these natural language descriptions. Additionally, due to lack of reference identifiers, the same reference may be counted multiple times when it is used for different ECHA studies, thus inflating the estimates.
We can evaluate use of animals in reference studies over time by first assessing the distribution of study start dates (Fig. 7) and then finding the distribution of number of animals used in each year (Fig. 8). We used simple heuristics to estimate animal counts from natural language. When comparing Figure 7 and 8 it appears that the number of animals used per reference was lower in the late 2000s relative to the 1990s.

Data overlap
To determine the relevance of ECHA extracted data in the context of current toxicological databases, the 9,801 extracted REACH compounds were searched against three well-known toxicity datasets: Toxicity Reference Database (ToxRefDB), Toxicity Testing in the 21 st Century (Tox21) and Comparative Toxicogenomics Database (CTD).
ToxRefDB is a collection of 30 years of animal toxicity testing data in the US Environmental Protection Agency (US EPA) and contains 474 compounds (Martin et al., 2009).  -Ramos et al., 2013). This target chemical library mainly consists of compounds of environmental interest (e.g., high production volume compounds, pesticides, drugs, etc.).
The CTD consists of 13,446 compounds with toxicogenomics data (e.g., drug molecules). This public database aims to explore how environmental exposures impact human health via manually curated chemical-gene, chemical-protein, chemical-disease and gene-disease interactions.
REACH compounds have the largest overlap (1,737 compounds) with Tox21 compounds, possibly reflecting the similar goals of Tox21 and REACH (Tab. 5). The overlap between REACH and CTD is much lower. The extracted REACH substances cover 11% of Toxcast, 20% of Tox21 and 7% of CTD. The biological data available in these datasets combined with in vivo endpoints extractable from REACH represent a strong modeling potential.
PubChem, a large chemical database hosted by the National Center for Biotechnology Information (NCBI) and the National Institutes of Health (NIH) (Cheng et al., 2014), currently contains 68 million compounds tested in over 1 million bioassays, including massive amounts of toxicity data. It is not surprising that 4,955 of the REACH substances are found here. ChEMBL, established by the European Bioinformatics Institute, is part of the European Molecular Biology Laboratory (EMBL). ChEMBL is a chemical-bioassay database manually curated from peer-reviewed publications consisting mostly of drug-like compounds (Gaulton et al., 2011), but 2,080 of the REACH chemicals are also represented here. Both repositories are thus very rich for further analysis.

Discussion
Massive amounts of toxicity data have been generated in the past decade and various data repositories have been developed to share data with research communities. REACH is the largest of these efforts with expected multi-billion Euros of testing costs (Hartung and Rovida, 2009;Rovida and Hartung, 2009), but so far its full potential has not been realized. A searchable repository of the publically available REACH data represents an enormous resource for toxicology, particularly computational approaches requiring large datasets.
REACH data can be used to inform risk assessments, develop computational models, develop and evaluate test strategies, and improve / store toxicological knowledge on a per study basis. The extracted data is far from perfect as the non-standardized presentation of data in many narrative fields is prone to errors when extracted automatically with search engines. While the primary objective of REACH submissions is not data extraction and mining, this publication and others in this issue (Luechtefeld et al., 2016a-c) demonstrate the potential value of ECHA reports submitted for REACH. Further curation, as well with data from registrations occurring post December 2014, would be extremely helpful.
Ultimately, reduction of animal testing will depend in a large part on the development of in silico models such as QSAR (Zvinavashe et al., 2009;Patlewicz et al., 2013Patlewicz et al., , 2015.

Improvement of computational models relies on accessibility of training and testing data.
The open data nature of Tox21, ToxRefDB, PubChem, CTD and ChEMBL promotes numerous publications and development of ever improving statistical and expert models.
Overlaps of REACH with existing databases given in Table 5 further demonstrate the value of the extracted data: ToxRefDB (a commonly used animal testing database) covers only 474 substances with multiple animal endpoints while the extraction in this publication covers over 9800.

Conclusion
The extracted ECHA dataset first of all allows us to better understand the landscape of substances for a given hazard: Which parts of the chemical universe are associated with a given hazard? How concordant and reproducible are different methods? With the limited information of the New Chemicals Database (NCD) of the EU (which is not publicly available), it has previously been shown how much useful information can be extracted from such databases using the example of skin irritation . Our parallel articles in this ALTEX issue address the most prevalent human hazards, i.e., oral toxicity, skin sensitization and eye irritation.
One goal of this publication is to underscore the importance of structuring data in a machine-readable format -while REACH in many ways has a workable ontology for classifying endpoints, the toxicological value of REACH data could be realized by using formal data structures for results extracted from the main guideline-compliant studies, especially for the key hazards of eye irritation and skin sensitization, which easily lend themselves to this approach. Eventually the development of ontologies (e.g., OpenTox, ToxML) to classify studies by type and study results and outcomes for more complicated endpoints, such as developmental toxicity, will greatly aid the ability of toxicologists to assemble large datasets.
Furthermore, it is our hope that our arguments and referenced articles will motivate the systematic and more comprehensive publication of REACH data to the general public. An open REACH platform would allow third parties to investigate concepts such as OECD TG use and quality assessment, testing redundancies, and hazard distributions, and could automate many research tasks.  Possible double counting due to missing reference identifiers in ECHA dossiers.  Tab. 2

Intermodular similarity as determined by cosine of angle between module substructure importance vectors
Substructure importance vectors are determined via analog to TFIDF, where a module's importance for a given substructure is given by its frequency within the module multiplied by the inverse of its frequency in all substances. Green cells show the greatest similarity for the module in each row. These similarities fit well with visual inspection of Figure 4.