Structuring, connecting and discovering knowledge: Our biotech NLP platform helps to identify therapeutic targets from millions of scientific papers


In biomedicine, much of the information generated is made publicly available by and to the scientific community. These open data resources are attracting significant attention, as the pharmaceutical industry finds itself continuously under pressure to speed up the costly process of drug discovery. Of course, the question of how open data’s potential could be unlocked for Pharma R&D opens up new avenues for data science and analytics. idalab has successfully accelerated early stage drug discovery for several innovative biotech startups, by using natural language processing to help identify therapeutic targets from millions of scientific papers.

Why is data ‘open’ in biomedicine?

The tradition of open data dates back to the eighties – when bioinformatics was emerging – and has cohabited alongside proprietary databases since then. Furthermore, transparency policies adopted by the regulatory agencies require the registration of clinical trials and the publication of their results. Accordingly, technology, best practice, and legislation have bolstered the development of open biomedical data on the Internet in the last decade. That expansion enhances the power of knowledge discovery in databases, which draws upon expertise in both pharma and data science.

Until recently, biomedical databases were largely used by molecular biologists, clinicians, and students for research purposes. This is changing. Biotech companies increasingly rely on open data to support their processes while they take up a growing share of the early drug discovery activities. In this context, idalab was first approached by a US-based biotech startup, looking for a way to systematically make use of this wealth of information, scattered across a variety of databases.

Building a platform: first connect sources, then search

Because information like nucleotide sequences, protein structures, clinical data, and scientific papers all reside in their different silos, knowledge discovery necessitates the meaningful integration of heterogeneous data spread across different databases. What this practically means: building a platform. A platform that includes all relevant databases as well as specific modules for a targeted search is the foundation of actionable knowledge discovery. While some databases already include cross-references to related material in other registries, advanced matching techniques enable meaningful connections where those links do not yet exist. This is where machine learning algorithms and natural language processing have a pivotal role to play.

Generating insights from the literature – on a very large scale

The biomedical literature especially is a major resource to consider, as the millions of recent and past publications hold valuable information on potential scientific breakthroughs. In order to detect those nuggets, it is essential to link the data found in the literature to other resources, which is done using natural language processing (NLP). This process allows to extract information by turning unstructured text into structured data, hence paving the way to creating useful cross-references. Enabling the meaningful selection and integration of data, NLP is at the interface of biomedicine and data science. Customised to the biomedical corpus, it is the backbone of our platform.

By tailoring our NLP platform to our clients’ needs we can dramatically accelerate their target identification process

Further reading

Our white paper provides an overview of established data sources based on our methodology for assessing biomedical datasets. Among the data relevant to pharmaceutical R&D, we focus on five key categories: genes & proteins, compounds & chemical information, biomedical literature, clinical trials, and terminologies & ontologies.

Gillian Hertlein

Associate Consultant

+49 (0) 176 38 55 48 16


Potsdamer Straße 68
10785 Berlin