Generalized Read-Across ( GenRA ) : A Workflow Implemented into the EPA CompTox Chemicals Dashboard

Generalized Read-Across (GenRA) is a data driven approach that makes read-across predictions on the basis of a similarity weighted activity of source analogues (nearest neighbors). GenRA has been described in more detail in the literature (Shah et al., 2016; Helman et al., 2018). Here we present its implementation within the EPA’s CompTox Chemicals Dashboard to provide public access to a GenRA module structured as a read-across workflow. GenRA assists researchers in identifying source analogues, evaluating their validity and making predictions of in vivo toxicity effects for a target substance. Predictions are presented as binary outcomes reflecting the presence or absence of toxicity together with quantitative measures of uncertainty. The approach allows users to identify analogues in different ways, quickly assess the availability of relevant in vivo data for those analogues, and visualize these in a data matrix to evaluate the consistency and concordance of the available experimental data for those analogues before making a GenRA prediction. Predictions can be exported into a tab-separated value (TSV) or Excel file for additional review and analysis (e.g., doses of analogues associated with production of toxic effects). GenRA offers a new capability of making reproducible read-across predictions in an easy-to-use interface. This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is appropriately cited. 1 https://comptox.epa.gov/dashboard Disclaimer: The views expressed in this article are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.


Introduction
Given the thousands of data-poor or toxicologically uncharacterized chemicals in commerce, read-across has proved to be a convenient and efficient data gap filling technique that can be used within analogue and category approaches for many different regulatory purposes.Read-across represents the application of data from a source chemical(s) for a particular property or effect to predict the same property or effect for the target chemical (the chemical of interest) (OECD, 2014).Read-across is traditionally anchored with conventional in vivo and in vitro data, though concerted efforts are starting to be made to exploit high throughput (HT) and high content (HC) screening data as a means of substantiating biological similarity (Zhu et al., 2016;Shah et al., 2016).Some of these efforts are anchoring such data to key events within adverse outcome pathways (AOPs) (Schultz and Cronin, 2017).
Here we present the web-based implementation of Generalized Read-across (GenRA), a data-driven approach that makes repro-

Results and discussion
There are several steps in the development of a category or analogue approach (Patlewicz et al., 2017(Patlewicz et al., , 2018)).The seven key steps in the workflow are as follows: 1. Decision context 2. Data gap analysis 3. Overarching similarity rationale 4. Analogue identification 5. Analogue evaluation 6.Data gap filling 7. Uncertainty assessment In the GenRA implementation, the steps have been addressed as shown in Figure 1 (Helman et al., 2018).
The starting point for GenRA relies on identifying the chemical of interest (target chemical) by performing a "basic" search within the EPA CompTox Dashboard.The outcome of a search gives rise to a "chemical details" landing page with a number of selectable tabs and sub-tabs to the left of the screen (Fig. S1 13 ).One of the tabs navigates to the GenRA module.
Once GenRA is selected, a grid like display is presented with an indicator at the top of the page that reflects the relevant step in the read-across workflow.Users can navigate between steps by clicking on the indicator bar (Fig. S2 13 ).
The starting grid display only has the first window unobscured.This grid window shows the neighborhood of source analogues that surround the target substance which appears in the center of a radial plot (Fig. S3 13 ).Starting from 12-o'clock on the plot, analogues are ordered in decreasing order of similarity as calculated by the Jaccard index 14 (which ranges from 0 to 1, where 0 denotes dissimilar and 1 denotes identical).This radial plot represents the analogue identification and evaluation steps of the workflow.By default, 10 analogues are shown which are based on Morgan chemical fingerprints.The view can be updated by choosing a different fingerprint type and by changing the number of analogues.A minimum of 5 analogues and maximum of 10 analogues can be selected.Analogues are automatically filtered by the availability of in vivo toxicity data as taken from ToxRefDB v1.0.This is to ensure that analogues identified are helpful in a read-across prediction.Hovering over any of the source analogue depictions in the radial plot reveals the numerical pairwise similarity between the target and that of the analogue.If the user wishes to state transfer (REST) web services; and 3) a data tier for storing large-scale chemical, bioactivity, and toxicity data for thousands of chemicals.The presentation tier of GenRA is implemented using Vue 2 and each step in the workflow is designed as a self-contained component.Each component intuitively captures the key tasks that must be performed in the workflow via a combination of inputs (i.e., buttons, input items, etc.) and an interactive graphical output.All graphical outputs of the individual components are implemented as scalable vector graphic (SVG) elements with context sensitive help information and/or interaction capabilities.The presentation layer components in GenRA perform their specific tasks by obtaining information about chemicals, analogues, bioactivity and toxicity from the application tier.The application tier is implemented in Python 3 using the Flask 4 microservices framework, which is deployed using Apache/ wsgi 5 .The data tier is implemented using MongoDB 6 , which is a document-oriented NoSQL database.Information about chemicals, bioactivity and toxicity are stored as separate Mon-goDB collections to facilitate the efficient implementation of GenRA algorithms.Chemical structure data were obtained from the Distributed Structure Searchable Toxicity (DSSTox) database (originally extracted April 2017 but updated continuously) (Richard et al., 2016; 7 ) whereas chemical descriptors, comprising Morgan fingerprints (Rogers and Hahn, 2010) and topological torsion descriptors (Nilakantan et al., 1987), were generated using RDKit 8 .ToxPrint chemotypes were generated using the AM-MN Chemotyper for command line operation (Yang et al.,  2015; 9 ).
The bioactivity high throughput screening (HTS) data were obtained from the ToxCast 10 and Tox21 11 programs.The in vivo toxicity data was obtained from ToxRefDB v.1.0 12.
Bioactivity descriptors (denoted biology or bio) comprised hit calls (active (1) and inactive (0)) from 820 ToxCast HTS assays.The 820 bioactivity descriptors were converted into fingerprints that are used singly (chm or bio to denote either chemical or bioactivity descriptors) to predict up to 129 toxicity outcomes from 10 different study types from ToxRefDB v1.0.The study types are namely acute (ACU), subacute (Sub), subchronic (SAC), neurotoxicity (NEU), developmental neurotoxicity (DNT), developmental toxicity (DEV), reproductive toxicity (REP), and multigenerational toxicity (MGR).A final category of other (OTH) is for any study not fitting any of the previously mentioned study types.
are available or alternatively that the analogue set is missing the data for the toxicity effects of most interest.
Whilst the summary views are helpful to gain a brief perspective of how much data are available for the target and source analogues, they do not provide any information on their potency (e.g., lowest effect limit (LEL) in mg/kg-day, etc.) or hazard profile.To evaluate this type of information, the "Generate Data Matrix" button is clicked to move to the next step of the workflow "Run GenRA Prediction".At this point, the final grid becomes unobscured to reveal a matrix view of the target and source analogues.The initial part of the assessment here addresses the "Analogue Evaluation" step since the user can evaluate the consistency and concordance of the analogues, relative to their experimental data, in terms of the presence or absence of toxicity effects.Presence and absence is reflected by the colors of the boxes in the data matrix: red for the presence of toxicity effects, blue for the absence of toxicity effects, and grey for no data.Hovering over any box reveals a tool tip indicating no data, or no effects for grey and blue colors, respectively, whereas the doses at which toxicity effects were reported are shown for red boxes.The data matrix view, using the same color codes (Fig. S6 13 ), provides the user with an informative perspective of the consistency and concordance of the available data across the analogues and between the endpoints.Users can filter the effects of interest using the filter window, select the threshold for the number of positives and negatives within the analogue set, and alter the view so that the similarity index is used to shape the size of the data matrix boxes (Fig. S7 13 ).The data matrix is ordered by the target substance in the first column, followed by the source analogues in order of decreasing similarity.
The full extent of toxicity effects can be browsed by using the scroll bar to the right of the screen.Users can also elect to deselect a source analogue from further consideration by clicking on the conduct a GenRA analysis for a different source analogue, or wishes to view the Chemical Results page, clicking on the structure depiction in the radial plot will open a new browser tab with the respective chemical details page of that analogue in the Dashboard.Once a user is satisfied with the analogues identified, the Next button needs to be clicked to proceed to the next step of the workflow -Data gap analysis (denoted as Step Two: Data Gap Analysis and Generate Data Matrix in the interface).
At this point, the next two grid views become unobscured and the workflow indicator changes to "Data Gap Analysis & Generate Data Matrix".The first of these grid views is denoted as "Summary Data Gap Analysis" (Fig. S4 13 ).This view is intended to provide a landscape of the quantity of data records for the target substance and its source analogues with respect to different data streams listed earlier -ToxCast, Tox21, Chemotypes and ToxRefDB.The number of records is marked in the colored boxes and reflected in the color itself -the black box indicates the greatest number of records whereas a yellow box indicates fewer records.Colors are automatically assigned by the underlying number of records.The summary is to provide a rapid perspective of how feasible a read-across might be based on the quantity of data for the source analogues.
The second grid view (Fig. S5 13 ) reflects ToxRef as a group by Tox Fingerprint.In this case, the data view shows the in vivo toxicity effect records as represented by this toxicity fingerprint.A black colored box in the grid view denotes the presence of a record for a particular toxicological effect.The utility of the grid view is to help a user gain a perspective of what data gaps exist for the source analogues relative to the target substance, and which effects might be reasonably predicted by those analogues.The entire matrix can be browsed using the scroll bar.A user might choose to focus on a subset of effects with the knowledge that the identified analogues will be helpful in that regard as data