Navigating the Biomedical Data Landscape – Part I
Data science projects are successful when they produce actionable results over several years. Since databases constitute the foundation of those endeavours, their selection is highly strategic, and the biomedical field is no exception to that rule. Understanding the data landscape enables well-informed decisions and fosters project success.
Long-term data needs
Sustainability is crucial in choosing datasets, as changing sources in the midst of a project can prove very costly. While many biological databases are produced as outcomes of research activities, most of them are archived or simply die within 10 to 15 years. Their longevity correlates with long-term financial support and institutional backing, which is why resources provided by big institutions should be preferred. Those include, among others, the National Library of Medicine in the US, which is a part of the National Institutes of Health, the European Bioinformatics Institute in Europe, and the WHO.
The output objectives of a data product define the information needs. Large compound datasets and the biomedical literature are, for instance, essential sources in drug discovery. Trial registries and resources from the regulatory agencies, on the other hand, help improve the efficiency of clinical studies. In any case, data science projects in biomedicine are complex and necessitate input from field experts. Manual curation, annotation, and validation of the data add tremendous value to the analysis.
Strategic integration of data sources
Yet collecting expert input is a slow and costly process that does not scale to large volumes of data. Terminologies and ontologies reduce the amount of human intervention required by enabling efficient data integration. Those classifications provide the reference model for connecting sources to each other. They also support the performance of data products in the long run, as they guarantee system interoperability through time, for instance when new databases are added or supplementary features needed.
The content of biomedical databases central to pharma R&D is heterogeneous. To provide an overview of the open data landscape, we assessed twenty-one established resources in our white paper, Open Data in Biomedicine. We grouped those databases in five categories: genes & proteins, compounds & chemical information, biomedical literature, clinical trials, and terminologies & ontologies. In the present article, we describe “genes & proteins” as well as “compounds & chemical information” (see table below). The three latter categories are the subject of next week’s post.
Genes & proteins
Databases used in the field of genomics and proteomics form the backbone of modern drug discovery. They are the product of sequencing technologies and make up a heterogeneous group. For instance, they cover nucleotide and protein sequences, descriptions of human phenotypes, and 3D structures of large biological molecules (see figure below).
Those datasets contain results submitted by research groups around the world and subsequently curated either by teams of experts or by automated procedures. The generation of abundant data guaranteed, the main challenge now consists in interpreting that constant flow. This makes the discovery of new therapeutic targets possible by understanding the relationships between genes and diseases in combination with structural data.
Figure: Coverage of ENA, Ensembl, UniProt, RCSB PDB, and OMIM.
Compounds & chemical information
Chemical compounds and medicines are made available in two types of databases that enhance one another. The first class covers drug-like molecules as well as medicines that are either still at the experimental stage or already approved. Those datasets involve many features of molecules ranging from their nomenclature and structure to their targets and pharmacological properties. ChEMBL and DrugBank are famous examples.
Open source databases of chemical compounds have become mature resources in biochemistry and drug discovery since the Therapeutic Target Database was released in 2002. The quality and quantity of their data have continuously improved from that point onwards, leading to their extensive use in early stages of drug discovery. Drawing upon large compound databases, virtual screening allows to identify the candidate molecules most likely to exhibit biological activity at the investigated therapeutic targets.
The second class of compounds datasets lists medicines that underwent the registration process at the FDA or the EMA. It is based on regulatory documents related to the applications for marketing authorisation submitted by pharmaceutical companies to the regulatory agencies. This complements data on chemical compounds nicely, as it reflects the finishing line of drug development.