The Power of Open Data for Pharmaceutical R&D


Biomedicine is a central driving force of the rise of big data. High-throughput screening and the increase of computing power have led to the generation of vast amounts of data, opening new avenues for analytics. Consequently, the omics revolution brings researchers closer to comprehending the underlying mechanisms of diseases.


Unlike in other fields and industries, much of the information generated is made publicly available by and to the scientific community. This tradition of open data dates back to the eighties – when bioinformatics was emerging – and has cohabited alongside proprietary databases since then. Furthermore, transparency policies adopted by the regulatory agencies require the registration of clinical trials and the publication of their results. Accordingly, technology, best practice, and legislation have bolstered the development of open biomedical data on the Internet in the last decade. That expansion enhances the power of knowledge discovery in databases, which draws upon expertise in both pharma and data science.

Open data resources are attracting significant attention, as the pharmaceutical industry experiences higher pressure due to growing drug development costs, decreasing R&D productivity, and increased requirements from payers to show value for money. Our newly published white paper provides an overview of established data sources based on our methodology for assessing biomedical datasets. Until recently, those databases were largely used by molecular biologists, clinicians, and students for personal and research purposes. This is changing. Biotech companies increasingly rely on open data to support their processes while they take up a growing share of the early drug discovery activities.

Among the data relevant to pharmaceutical R&D, we focus on five key categories: genes & proteins, compounds & chemical information, biomedical literature, clinical trials, and terminologies & ontologies. We specifically examine data compiled in databases that are accessible via web interfaces and stored in machine-readable format. This setup allows users to perform analyses with a certain degree of automation and, most importantly, to purposefully integrate data along the entire value chain. Indeed, several databases include cross-references to related material in other registries. Where those links do not yet exist, however, advanced matching techniques in data science enable meaningful connections.

This is where machine learning algorithms and natural language processing have a pivotal role to play. They support early drug discovery and clinical development, thereby helping pharmaceutical companies to produce safe and effective medicines.

To read more about the open data landscape in biomedicine, download our white paper.

Luis Dreisbach

Associate (Data Science)

+49 (0) 162 23 74  359


Potsdamer Straße 68
10785 Berlin