Navigating the Biomedical Data Landscape – Part II
Data science is revolutionizing pharmaceutical R&D; and knowing how to navigate the open data landscape is key to the revolution. Databases that are fundamental to supporting pharma R&D stem from various disciplines, such as genomics, structural biology, and clinical development. Hence, understanding the heterogeneity of biomedical data is crucial for meaningful data integration. Last week, we described the first two categories of resources covered in our white paper, Open Data in Biomedicine. This post presents the remaining three major types of sources: the scientific literature, clinical trial registries, and biomedical ontologies (in table below).
The scientific literature is the mirror of biomedical research worldwide and contains highly valuable information. That data is, however, in unstructured text format. PubMed is the most renowned resource referencing abstracts of biomedical articles. First released in 1996, this website gives access to MEDLINE, NLM’s prime bibliographic database. In addition, free access to full-text articles is provided via PubMed Central since 2000. Its entries match for the most part abstracts listed in PubMed. The European open access platform for the biomedical literature was launched in 2007 and later renamed Europe PMC. It incorporates resources from PubMed and PubMed Central, thereby gathering abstracts and full-text papers on one platform. As a member of the PMC International network of digital archives, Europe PMC also aims at preserving access to the biomedical literature.
The Cochrane Database of Systematic Reviews (CDSR) is another useful source of scientific publications. It includes over 7,000 review articles written by specialists who look back at the recent advances in their field. Those thorough summaries are the product of a collaborative effort and reflect the current consensus. Systematic reviews thus contain information of high quality, but result from a slow process and are limited in scope. In contrast, the literature in its entirety also holds data on potential breakthroughs, the hidden gold nuggets. Text mining enables the automated extraction of that valuable information scattered across millions of papers. By turning unstructured text into structured data ready for analysis, text mining helps discover knowledge from the research pursued at biomedicine’s forefront.
Clinical trial registries are structured data sources on completed and running clinical studies (see figure below). ClinicalTrials.gov focuses on the United States and is the world’s largest national database. Made public in 2000, this resource was born from the FDA Modernization Act of 1997 as an information base for trials with at least one site in the US or subject to FDA regulation, as well as for compounds manufactured in the US. Later amendments expanded the range of information required from study sponsors, including the communication of trial results. The EU Clinical Trials Register is the corresponding platform for promoting transparency in clinical development within the European Union. This website was launched in 2011 and provides access to the EudraCT database, a registry that systematically records studies since 2004 and that also comprises a result section.
The International Clinical Trials Registry Platform (ICTRP) is a central database that gathers the information of several national registries approved by the WHO, including ClinicalTrials.gov and the EU Clinical Trial Register. Launched in 2007, this initiative gives the broadest overview of all studies worldwide. Furthermore, since trials are often run in multiple countries, the ICTRP strives to unambiguously identify clinical studies registered in various locations. Overall, clinical trial registries are precious assets for detecting in-licensing opportunities, assessing the competitive landscape, or optimising study design.
Figure: Coverage of ClinicalTrials.gov, the International Clinical Trials Registry Platform (ICTRP), and the EU Clinical Trials Register.
Terminologies & ontologies
Data integration is key to leveraging the information present in multiple sources. Ontologies and terminologies are used to connect databases and to achieve consistency despite the particularities of each individual dataset. Not only do they catalogue the terms and concepts employed in a subject field, they also determine the hierarchy among entities and the relationships between them. Some classifications cover medical terms in general, while others focus more specifically on molecular entities or human diseases. Ontologies are updated regularly to reflect discoveries and recent changes in biomedicine. They mirror the conceptualisation of an area of knowledge and, therefore, facilitate communication amongst scientists. Most importantly, ontologies guarantee the interoperability of information systems and increase their efficiency.