Under the hood: 5 practical lessons from developing Large Language Model Applications for Drug Discovery

Large Language Models stand out as a tangible example of generative artificial intelligence and the biotech industry has been quick to explore potential use cases. Even so, practical real-world insights and lessons-learned remain scarce. Here we share our experience and best practices from developing LLM-enabled applications for drug discovery clients.

When asked which technology they would prioritise for investments, the majority of healthcare professionals pointed to artificial intelligence in a recent survey by GlobalData. Indeed, despite a slow start, AI is now making notable inroads in the biotech and pharma sectors, with several companies currently testing “AI-developed” molecules in the clinic. 

Despite growing adoption, it is still difficult to get a look under the hood of AI models employed in drug discovery or development. Thus, the role of AI in practice remains a proverbial black box, with limited exchange of concrete information on benefits, challenges, learnings and best practices.

At idalab, we specialise in helping our clients in transforming their drug discovery and development processes through the integration of AI and ML models, such as, the revolutionary Large Language Model (LLM) technology. 

In this article we share five important lessons we have learned while developing LLMs for target discovery.

Ground the model in fact to avoid hallucinations

One key challenge when working with LLMs, such as ChatGPT, is that they tend to hallucinate. The massive volume of training data combined with the model's inclination to be "helpful" causes it to respond to questions in any situation, often with vague or commonplace statements. When seeking information about drugs, targets, and diseases, such hallucinated or grey-zone information proves less than beneficial. Indeed, it is comparable to asking someone for directions who doesn’t know the way and just points you somewhere, in order to be polite. As one quickly discovers, LLMs are even very good at fabricating convincing-sounding non-existent scientific references to back up their erroneous answers.

Therefore, in almost any application context, it is important to use a grounding strategy to avoid hallucinations.

How do we do this? The core idea is to constrain the model to base its “answers” only on reliable and relevant information, rather than generating output statistically, based on text ingested during training. Secondly, we limit and standardise the output of the model, to only answer the question if it has a high level of confidence, instead of producing vague and difficult-to-understand answers.

Keep the human in the loop 

One of the doomsday scenarios conjured by AI sceptics is that it might render humans obsolete. In drug discovery, this is an unlikely outcome. Even a model that can read the scientific literature with the highest accuracy will not just spit out a list of perfect drug targets, as some die-hard AI enthusiasts might have us believe.

Especially when developing LLMs for target discovery, the human component is vital, to refine the model for the unique and complex questions that scientists explore. Human input is required not only to translate such research questions into a machine-digestible format but also to systematise researchers’ experiences and intuitions about relationships between targets, diseases, and drugs. 

At idalab, we have found that the most significant factor influencing LLM performance is the way that we interrogate the model and how we standardise the output of the model, i.e. targeting and fine-tuning the prompt to the context at hand. In our experience, this goes as far as tailoring to a specific biological context, such as trafficking of PD-L1 vs. its degradation. This is why LLM-driven tool development in drug discovery needs multidisciplinary teams, with effective tools and workflows for collaboration.

Test, test, test to perfection

Getting LLM applications to work in practice is an art and a science: understanding the biological and industry context is key to success

Evaluating a model's performance is a crucial step to determine its utility as a decision support tool and inspire trust. When applying LLMs to extract relevant information from scientific literature, this entails assessing how often the model gets it wrong — misjudging the relevance of information it extracts from a paper or by overlooking genuinely relevant pieces of scientific literature.

When designing LLMs for target discovery, numerous factors could influence performance. At idalab, we have created an internal gold standard test set to compare these components. After testing numerous models across different biological contexts, we found that GPT-4 seems on par with other state-of-the-art models, such as PALM2 or Llama 2. Indeed, the LLM technology appears to matter less than the prompt design and output format. Moreover, setting up a systematic approach to managing variants of prompts may feel like overhead in the beginning, but is absolutely vital to move forward.

Our gold standard test set comprises manually curated scientific literature across different drug discovery contexts. This allows us to individually test and fine-tune our LLM tools with each iteration, ensuring consistent performance for the models we create for clients.

Tell the model what you already know

LLMs have revolutionised AI-driven text comprehension. Even so, several challenges remain that cannot be addressed by LLMs themselves, e.g. the retrieval of (relevant) text to feed into the model, as well as the assignment of relevant ontologies within the text. 

Further, when crafting a strategy for targeting the model to specific research questions, we need to account for the ambiguities inherent to scientific literature, such as genes with multiple names, different orthologs and gene family members grouped under a single alias, or proteins with names that overlap with investigator names, assay descriptions, dates, or similar. 

We address such ambiguities by adding contextual information from biological databases such as gene name synonyms or orthologs in other species.

Integrating our LLM tools into a contextual framework helps to comprehensively gather information relevant to specific scientific queries, ultimately boosting model performance and efficiency.

Data privacy matters

An essential worry when integrating LLMs into business practices relates to data privacy. Indeed, the information around a company's drug discovery strategy is highly sensitive and biotech executives are understandably protective of it. Therefore, many biotechs and pharma companies have restricted their team’s access to searchable third-party databases, or replicate private versions in-house.

Of course, data privacy concerns are equally valid for any LLM-enabled software application. In the free version of ChatGPT, for instance, data privacy is explicitly not guaranteed.

However, for providers of generic foundational LLM technology services such as Google or Microsoft, ensuring data privacy has already become a de-facto industry standard. This is hardly surprising as privacy is a must-have for most software applications, just think of file-sharing or messaging applications.

More specifically, what providers of managed LLM cloud services typically guarantee contractually is that neither prompts nor responses are retained, no training of LLM models or improvements of other services is performed using client data and data is never shared with third parties. Should this not suffice, then there is also the option of replicating a private instance in the cloud or even go self-hosted. However, before going down that path one should carefully weigh the potential risk against the extra effort, also in light of other cloud services which are in daily use.

Conclusion

Amidst a lot of unjustified AI hype, LLM technology stands out as a tangible breakthrough with many applications. In literature analysis, LLMs will bring about unprecedented depths of algorithmic text “understanding” that enables new levels of automation, and therefore speed and scale. 

But many other exciting applications are already being developed, where LLMs are put to work on other types of biological data, e.g. to produce new proteins with specific properties, design antibodies, or discover disease-relevant information.

Even so, none of these applications are simple plug-and-play. Getting LLM applications to work in practice is an art and a science: understanding the biological and industry context is key to success.

At idalab, we have gained deep expertise in implementing, tailoring and tuning such LLM-based tools to aid our clients’ drug discovery efforts. We are always happy to swap best practices and provide a second (or first!) opinion. Just get in touch.


Book a discovery call

In our free 30-minute 1:1 discovery calls, a member of our team will be happy to answer your questions, share lessons learned and discuss how to address your specific needs.


 

Further reading

Drug discovery companies are customizing ChatGPT: here’s how (Neil Savage in Nature) discusses the various ways in which biotechs are putting Large Language Models to use.

Inside the nascent industry of AI-designed drugs (Carrie Arnold in Nature Medicine) discusses recent advances in AI-developed drugs in the biotech industry

Engineering Biology: ML as Process Efficiency (Blog)  discusses how to integrate AI / ML tools into scientific workflows

Previous
Previous

Interview with Ralf Banisch, Senior Data Scientist at mindpeak

Next
Next

From the Depths of Literature: How Large Language Models Excavate Crucial Information to Scale Drug Discovery