By PAUL VON BÜNAU
Data science projects are successful when they produce actionable results over several years. Since databases constitute the foundation of those endeavours, their selection is highly strategic, and the biomedical field is no exception to that rule. Understanding the data landscape enables well-informed decisions and fosters project success.
Long-term data needs
Sustainability is crucial in choosing datasets, as changing sources in the midst of a project can prove very costly. While many biological databases are produced as outcomes of research activities, most of them are archived or simply die within 10 to 15 years. Their longevity correlates with long-term financial support and institutional backing, which is why resources provided by big institutions should be preferred. Those include, among others, the National Library of Medicine in the US, which is a part of the National Institutes of Health, the European Bioinformatics Institute in Europe, and the WHO.
The output objectives of a data product define the information needs. Large compound datasets and the biomedical literature are, for instance, essential sources in drug discovery. Trial registries and resources from the regulatory agencies, on the other hand, help improve the efficiency of clinical studies. In any case, data science projects in biomedicine are complex and necessitate input from field experts. Manual curation, annotation, and validation of the data add tremendous value to the analysis.
Strategic integration of data sources
Yet collecting expert input is a slow and costly process that does not scale to large volumes of data. Terminologies and ontologies reduce the amount of human intervention required by enabling efficient data integration. Those classifications provide the reference model for connecting sources to each other. They also support the performance of data products in the long run, as they guarantee system interoperability through time, for instance when new databases are added or supplementary features needed.
The content of biomedical databases central to pharma R&D is heterogeneous. To provide an overview of the open data landscape, we assessed twenty-one established resources in our white paper, Open Data in Biomedicine. We grouped those databases in five categories: genes & proteins, compounds & chemical information, biomedical literature, clinical trials, and terminologies & ontologies. In the present article, we describe “genes & proteins” as well as “compounds & chemical information” (see table below). The three latter categories are the subject of next week’s post.
Genes & proteins
Databases used in the field of genomics and proteomics form the backbone of modern drug discovery. They are the product of sequencing technologies and make up a heterogeneous group. For instance, they cover nucleotide and protein sequences, descriptions of human phenotypes, and 3D structures of large biological molecules (see figure below).
Those datasets contain results submitted by research groups around the world and subsequently curated either by teams of experts or by automated procedures. The generation of abundant data guaranteed, the main challenge now consists in interpreting that constant flow. This makes the discovery of new therapeutic targets possible by understanding the relationships between genes and diseases in combination with structural data.
Compounds & chemical information
Chemical compounds and medicines are made available in two types of databases that enhance one another. The first class covers drug-like molecules as well as medicines that are either still at the experimental stage or already approved. Those datasets involve many features of molecules ranging from their nomenclature and structure to their targets and pharmacological properties. ChEMBL and DrugBank are famous examples.
Open source databases of chemical compounds have become mature resources in biochemistry and drug discovery since the Therapeutic Target Database was released in 2002. The quality and quantity of their data have continuously improved from that point onwards, leading to their extensive use in early stages of drug discovery. Drawing upon large compound databases, virtual screening allows to identify the candidate molecules most likely to exhibit biological activity at the investigated therapeutic targets.
The second class of compounds datasets lists medicines that underwent the registration process at the FDA or the EMA. It is based on regulatory documents related to the applications for marketing authorisation submitted by pharmaceutical companies to the regulatory agencies. This complements data on chemical compounds nicely, as it reflects the finishing line of drug development.
Revolutionizing pharmaceutical R&D
Data science is revolutionizing pharmaceutical R&D; and knowing how to navigate the open data landscape is key to the revolution. Databases that are fundamental to supporting pharma R&D stem from various disciplines, such as genomics, structural biology, and clinical development. Hence, understanding the heterogeneity of biomedical data is crucial for meaningful data integration. Last week, we described the first two categories of resources covered in our white paper, Open Data in Biomedicine. This post presents the remaining three major types of sources: the scientific literature, clinical trial registries, and biomedical ontologies (in table below).
The scientific literature is the mirror of biomedical research worldwide and contains highly valuable information. That data is, however, in unstructured text format. PubMed is the most renowned resource referencing abstracts of biomedical articles. First released in 1996, this website gives access to MEDLINE, NLM’s prime bibliographic database. In addition, free access to full-text articles is provided via PubMed Central since 2000. Its entries match for the most part abstracts listed in PubMed. The European open access platform for the biomedical literature was launched in 2007 and later renamed Europe PMC. It incorporates resources from PubMed and PubMed Central, thereby gathering abstracts and full-text papers on one platform. As a member of the PMC International network of digital archives, Europe PMC also aims at preserving access to the biomedical literature.
The Cochrane Database of Systematic Reviews (CDSR) is another useful source of scientific publications. It includes over 7,000 review articles written by specialists who look back at the recent advances in their field. Those thorough summaries are the product of a collaborative effort and reflect the current consensus. Systematic reviews thus contain information of high quality, but result from a slow process and are limited in scope. In contrast, the literature in its entirety also holds data on potential breakthroughs, the hidden gold nuggets. Text mining enables the automated extraction of that valuable information scattered across millions of papers. By turning unstructured text into structured data ready for analysis, text mining helps discover knowledge from the research pursued at biomedicine’s forefront.
Clinical trial registries are structured data sources on completed and running clinical studies (see figure below). ClinicalTrials.gov focuses on the United States and is the world’s largest national database. Made public in 2000, this resource was born from the FDA Modernization Act of 1997 as an information base for trials with at least one site in the US or subject to FDA regulation, as well as for compounds manufactured in the US. Later amendments expanded the range of information required from study sponsors, including the communication of trial results. The EU Clinical Trials Register is the corresponding platform for promoting transparency in clinical development within the European Union. This website was launched in 2011 and provides access to the EudraCT database, a registry that systematically records studies since 2004 and that also comprises a result section.
The International Clinical Trials Registry Platform (ICTRP) is a central database that gathers the information of several national registries approved by the WHO, including ClinicalTrials.gov and the EU Clinical Trial Register. Launched in 2007, this initiative gives the broadest overview of all studies worldwide. Furthermore, since trials are often run in multiple countries, the ICTRP strives to unambiguously identify clinical studies registered in various locations. Overall, clinical trial registries are precious assets for detecting in-licensing opportunities, assessing the competitive landscape, or optimising study design.
Terminologies & ontologies
Data integration is key to leveraging the information present in multiple sources. Ontologies and terminologies are used to connect databases and to achieve consistency despite the particularities of each individual dataset. Not only do they catalogue the terms and concepts employed in a subject field, they also determine the hierarchy among entities and the relationships between them. Some classifications cover medical terms in general, while others focus more specifically on molecular entities or human diseases. Ontologies are updated regularly to reflect discoveries and recent changes in biomedicine. They mirror the conceptualisation of an area of knowledge and, therefore, facilitate communication amongst scientists. Most importantly, ontologies guarantee the interoperability of information systems and increase their efficiency.