Data Science for Pharma – A Short Case Study


Open data in biomedicine is a gold mine that can strengthen innovation in pharmaceutical R&D. In combination with the right analytics, public data helps identify therapeutic targets and ligands, enhance clinical development, and boost portfolio management efficiency. The challenge is to purposefully integrate abundant and heterogeneous data scattered across data sources.

To make the significance of connecting those datasets concrete, we here present a practical example of knowledge extraction from the scientific literature. In this short case study, we elect to associate each precancerous indication with all drugs investigated against it. Additionally, the analysis also highlights potential therapeutic targets with determined 3D structures. We describe the setup of this illustrative analysis in three steps.

Step 1. Detect drugs and precancerous indications in unstructured text

The scientific literature contains the most up-to-date information on clinical and preclinical studies in the form of unstructured text. Access to the abstracts of biomedical articles relevant to the analysis is provided by querying PubMed with the MeSH heading “Precancerous Conditions” (unique ID D011230). That corpus is then mined using natural language processing (NLP) on the publications’ titles, abstracts, and metadata. On one hand, drugs are extracted based on the DrugBank Vocabulary listings of original drug names and their synonyms. On the other hand, the targeted entities for diseases are selected among the terms beneath “Precancerous Conditions” in the hierarchical MeSH tree.

Named entity recognition. Precancerous indications and drugs are detected in scientific abstracts.

Step 2. Assemble drug-disease pairs and assess their connection strength

Drug-disease pairs are uncovered using relationship extraction algorithms customised to the biomedical corpus. This NLP approach draws upon features, such as the co-occurrence of entities in a sentence and the distance between them. The strength of drug-disease pairs is then scored with weighted counts based on the number of articles in which they are found, how often they appear in those articles, the journals impact factor, the publication type, the number of citations, etc. Since abstracts only contain a limited amount of information, the results of NLP can be improved significantly by incorporating the full-text articles available in PubMed Central.


Relationship extraction. Precancerous indications are paired with the relevant drugs and the strength of those relationships are weighted.

Step 3. Rank list of drug-disease pairs enriched by therapeutic targets for expert assessment

Ultimately, the results of the analysis are delivered as a list ranking the drug-disease pairs according to their relationship strength. Each pair can also be associated to the articles in which they appear so that experts can navigate and review the evidence with ease. In addition, PubMed can be linked to RCSB PDB in a deterministic fashion to extract target proteins examined in those same articles and whose structure has been determined experimentally. To identify a more comprehensive set of targets, UniProt can be brought into play to bridge PubMed and RCSB PDB using text mining. As a result of the three steps, unstructured, scattered data is transformed into actionable knowledge for domain experts.

Knowledge discovery. Unstructured text information is turned into a ranked list. This structured format displays the drug-disease pairs, their relationship strength, the source articles, and relevant protein structures.

Actionable knowledge discovery

To harness the value of biomedical databases, it is essential to navigate the maze of information available with a well-defined case to solve. This enables the meaningful selection and integration of data, which constitutes the foundation of actionable knowledge discovery. Natural language processing is at the core of this process and thus at the interface of biomedicine and data science. Customised to the biomedical corpus, it is the centrepiece of next generation information processing engines.


For the detailed assessment of biomedical databases, download our white paper.

Benjamin Häusler


+49 (0) 176 81 69 84 82


Potsdamer Straße 68
10785 Berlin