No-nonsense: How ChatGPT-Technology helps Biotechs find better Drug Targets faster

ChatGPT for drug target identification?! What sounds like madness contains the kernel of a real game changer. If applied properly and grounded in scientific knowledge, the underlying technology can sift through millions of publications, finding highly specific pieces of evidence to identify the ideal target.

With few shots on goal, choosing the right targets can be a make-or-break decision for biotechs. In particular for platform biotechs, aiming to scale a modality or technology, the tools to navigate the target space are a crucial part of the discovery engine.

Given the ever-growing mountain of publications, R&D teams trying to focus on a set of targets, often face hundreds or thousands of publications to assess for each target - too many to read manually. Approaches to narrow the space of possible targets often rely on the research teams’ expertise and intuition, as well as expert advice. This approach is time consuming and potentially biased, but more importantly, likely to miss novel opportunities outside of the team’s experience. 

Data-driven tools for target prioritization such as pre-filtering approaches based on database information or keyword searches in scientific publications, have their uses but appear blunted in light of the complexity of biology. They can help drive an initial crawl for potentially relevant publications. But they cannot identify the complex semantic relationships that describe connections between biological entities, which is often additionally qualified by a strength of evidence.  

For example:

[<target> regulates <B> up in the case of <C> and we know this from an experiment of type <D>]

As anyone who has ever read an academic publication knows, there is a combinatorial number of ways in which such information can be expressed in human language, complicated further by synonyms and complex hierarchical ontologies for biological entities. 

To parse scientific literature in a way that can help research teams prioritize targets for a specific R&D strategy, large language models come to the rescue.

How to make ChatGPT-technology useful for target prioritization

By helping scientists triage potential targets much faster, and discover novel types of opportunities, implementing LLMs into biotech’s discovery engine can become a source of true competitive advantage.

At the beginning of 2023, Large Language Models (LLM) became incredibly popular through ChatGPT, a chat-like interface. In simple terms, what it does is answer a user-provided question (called prompt) with the most likely words that follow, using its statistical model (deep neural network) trained on vast amounts of human-written text.

As this “next-word-prediction” approach has proved (embarrassingly) useful for many everyday tasks, and ChatGPT appears knowledgeable on just about everything, it has been heralded as the solution to all sorts of problems. Like many supposed silver bullets, it fails to hit its target.

Asking ChatGPT straight up for “the top ten novel targets which can cure cancer” will not yield anything useful. First of all, hallucination: LLMs are prone to generate likely sequences of words that make no sense, e.g. making up entire publications that have never been written. Perhaps not harmful when you are researching banana bread recipes, but useless in drug discovery. And secondly, ChatGPT has no access to proprietary information, such as patent databases or full text publications, which are often crucial in researching the biological function of potential drug targets.

However, the underlying LLMs are the pivotal component of next-generation target prioritization systems. Imagine you are looking for targets which can target new routes of cancer immunotherapy. Instead of prompting an LLM directly to deliver “new cancer immunotherapy targets”, the key idea is feed it portions of relevant text (e.g. abstracts identified by keyword search) and then use it to identify the complex semantic patterns within that text. For example one could search for factors which influence the intracellular trafficking of the immune checkpoint protein Programmed Death-Ligand 1 (PD-L1),  a key player in cancer immunosuppression. By evidence-grounding LLMs based on relevant pieces of scientific literature, you do not only get reliable (non-hallucinated) results, but also the link to the underlying evidence in the original paper, which scientists can use to validate and dig deeper.

The accuracy evidence-grounded LLMs achieve is truly revolutionary. For example, given the following portion of text taken from a scientific publication, it can correctly identify the relevance of an intracellular receptor for PD-L1 trafficking and expression:

"Elevated receptor X levels correlated with normal levels of PD-L1 in prostate cancer cells, consistent with a compensatory role for this receptor in endosomal targeting."

Clearly, making that inference would require an expert to read the text very carefully, and is far beyond what any form of clever keyword searching could achieve.

Unlike humans, LLMs love sifting through millions of publications day-and-night (for very little money). Therefore, one way the end-result of an LLM-driven target prioritization campaign can look like, is a simple table, picture below. Each row is a gene, the columns corresponding to the specific properties one is looking for, each cell shows the strength of evidence and links to the underlying source.

Table showing results of a LLM-based drug target prioritization tool

Once you have condensed thousands of publications in a table, you can apply connected tools to  help you sort, filter and score to dissect your target space. Moreover, as this process can be fully automated, it can be repeated regularly, or created alerts whenever new interesting evidence appears.

How to build an LLM-driven Target Prioritization Engine

The real power of this approach lies in tailoring the input and output of LLMs to the specific R&D strategy of a biotech company. By helping scientists triage potential targets much faster, and discover novel types of opportunities, implementing LLMs into biotech’s discovery engine can become a source of true competitive advantage. 

In our work with biotech clients, we have found that the key to success in building such a system is to work closely with the scientists, iterating over a well-defined benchmark set of publications until the accuracy is sufficient to be rolled out. The key parameters to tune are the quality criteria for the publications to consider (authors, journals, impact factors, etc.), the specific ways to interrogate the LLM, and various ways to deal with sophisticated ontological mappings and the output format, which can be integrated into the discovery process.

To accelerate this process, we have developed a structured approach that leads to proof-of-principle in less than three weeks, an LLM tuned towards biotech applications, our Unified Data Platform, and a scalable cloud-based delivery platform. Get in touch to learn more.


Book a discovery call

In our free 30-minute 1:1 discovery calls, a member of our team will be happy to answer your questions, share lessons learned and discuss how to address your specific needs.


 

Further reading

Drug discovery companies are customizing ChatGPT: here’s how (Neil Savage in Nature) discusses the various ways in which biotechs are putting large language models to use.

Generating ‘smarter’ biotechnology (Editorial in Nature Biotechnology) provides an overview of generative AI applications in life science, encompassing text, sequence and chemical structure data.

Can ChatGPT be used to advance drug discovery? (Willow Shah-Neville at Labiotech) focuses on use cases in drug discovery, pointing out limitations that can be overcome with our approach.

What Is ChatGPT Doing … and Why Does It Work? (Stephan Wolfram's blog) provides a tour-de-force yet accessible introduction to the methodology behind LLMs, starting from scratch.

Previous
Previous

From the Depths of Literature: How Large Language Models Excavate Crucial Information to Scale Drug Discovery

Next
Next

Data management for early-stage biotechs: how to get started on the right path