BioBERT: the new NLP model giving drug development teams a boost

With more than a million biomedical research papers published each year, no human researcher can keep track of everything that’s relevant to their field. Are specially trained NLP models the answer? 

Jonas Nowottnick comments on: Jinhyuk Lee, Wonjin Yoon et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics, 2019, 1–7

The challenge

More than a million scientific biomedical publications are listed each year on PubMed. How could any drug development researcher have time to scan them for the information that’s relevant to their work?

This means knowledge that could be vital to their understanding of a disease is being missed, and so their work suffers; it may not be as comprehensive, or useful to others, as it might be. There is, therefore, a pressing need to process these texts automatically – to help increase the knowledge base of drug development researchers and streamline their efforts.

However, general language representation models struggle to understand biomedical language. Trained only on general datasets, such as Wikipedia or BookCorpus, they perform quite poorly when the language is more technical or domain-specific, such as in the biomedical area.

Summary

The authors investigate how the recently introduced pre-trained natural language processing model BERT can be adapted for biomedical corpora. They introduce BioBERT (Bi-directional Encoder Representations from Transformers for Biomedical Text Mining), a domain-specific language representation model pre-trained on large-scale biomedical corpora.

According to the authors, BioBERT largely outperforms BERT and previous state-of-the-art language representation models in a variety of biomedical text-mining tasks, such as:

  • biomedical named entity recognition
  • relation extraction (such as identifying the relationship between a gene and a disease, or a protein and a small molecule)
  • question answering (such as identifying what acronyms stand for and, therefore, their meaning and relevance)
Fig.1: BioBERT is capable of solving tasks such as named entity recognition, relation extraction and question answering

How it works

The authors took the state-of-the-art BERT model and augmented it for biomedical domain tasks.

BERT is a contextualised language representation model, pre-trained using bi-directional transformers via a masked language approach.

Fig.3: Forward, backward and bi-directional learning approaches

Previous models used either unidirectional forward or backward learning approaches – or a combination of both. However, the masked learning approach used to train BERT is the first truly bi-directional one. It gives BERT a better “understanding” of context than previous language models and therefore improves performance.

Whereas BERT was pre-trained on English Wikipedia and BooksCorpus, BioBERT was refined for biomedical purposes. Training was resource-intensive: eight 32GB GPUs were fed PubMed abstracts and PubMed Central full-text articles for 23 days, taking in a total of around 18bn words.

Could this work in practice?

The BioBERT model is open source and fully applicable. However, to enable the model to perform tasks such as named entity recognition, relation extraction or question answering it must be fine-tuned. This is done by adding a task-specific layer, trained on a task-specific labeled dataset, to process BioBERT’s output. In this way, BioBERT is kept relatively versatile and streamlined within the biomedical field.

The paper shows very good performance metrics on biomedical tasks and promises simple handling of multiple tasks. Application of a model such as this, in a field forever inundated with new research information, is hugely important to the advancement of drug development.

Open questions

Could BERT be trained to operate successfully within other fields? Certainly, the more unique the language in a domain, and the greater the volume of text, the higher the chance of improving natural language processing performance with domain-specific models and producing useful results. Besides biomedicine, fields that fall into this category include clinical notes and patent texts.

 

 

Contact

Jonas Nowottnik

Student Trainee, Data Science

E-Mail
jonas.nowottnick@idalab.de

Address
Potsdamer Straße 68
10785 Berlin