Open goals: will AlphaFold2’s ‘open science’ usher in a drug discovery revolution?
DeepMind has handed researchers the tools to implement AlphaFold2 and explore its ground-breaking potential for biology and drug discovery. But a tide of cheaper, more effective therapies is far from inevitable.
The performance of DeepMind's deep learning network, AlphaFold2, at the CASP-14 protein structure prediction challenge in November 2020 induced a deluge of excited headlines. They shared one sentiment: this technology had the potential to kickstart a revolution in biology.
Less than a year later, in July 2021, the DeepMind team followed up with two milestone events:
The AlphaFold2 Protein Structure Database, which details the structures of 350,000 proteins, including 98.5% of the ~20,000 known human proteins, was completed and made public. Released in partnership with the European Bioinformatics Institute (EMBL-EBI), the database also contains the structures of proteins from 20 other species, including important model organisms such as mouse and zebrafish, plus pathogens such as malaria. The number of structures compiled in the database is predicted to increase to over 100 million – covering almost the complete set of proteins in the UniProT database – by the end of 2021
The DeepMind team published a description of their methodology in Nature; they also released AlphaFold2’s source code on GitHub, enabling researchers to validate, recreate and improve their methodology
These resources hand researchers the means to test the limits of AlphaFold2’s immense potential. But is that enough? What else needs to happen for the technology to bring about a (scientific) revolution – and what will that mean for drug development?
AlphaFold2 in the real world
Which steps of the drug development pipeline could be forever changed by the structural knowledge of proteins it provides? “The technology’s contribution will be at the early stages of drug discovery, until lead generation,” says Michael Schauperl, Machine Learning Engineer at HotSpot Therapeutics. “Here it can certainly be of help, giving us structures of proteins that have not yet been experimentally derived…”
Certainly, the number of protein structures accessible to researchers has exploded with the publication of AlphaFold2’s results. “The AlphaFold2 Database is a useful resource for drug developers,” says Schauperl. “The source code will likely be more interesting for the structure prediction community, but not as much for drug developers … However, another level where AlphaFold2 can make a contribution is faster confirmation of experimental structures. Deriving those can take researchers months to years, and the process also relies on having a template or initial guess of the protein structure … Having the AlphaFold2 prediction as a starting point can considerably speed up this process.” There’s no doubt about it: AlphaFold2’s performance is truly astonishing, generating 350,000 protein predictions in only 48 hours. “But I wouldn't call it a revolution yet,” says Schauperl.
Certainly, there are obstacles to AlphaFold’s usefulness for drug discovery. In its current state, the AlphaFold2 structure database’s information is reductionist, covering only one representative UniProt sequence per gene and only one possible conformation, and taking no account of the effects of ligand binding, protein complexes, mutations or post-translational modifications.
Moreover, AlphaFold2 as any other deep learning model, is only as good as the data it is trained on – the 170,000 experimentally derived protein structures available in the protein data bank (PBD). This induces a bias towards experimentally well-studied proteins and precludes the accurate prediction of protein structures for which there is little or no training data. Proteins that have intrinsically disordered regions, as well as structures that are determined by interaction with other proteins or other biomolecules such as DNA, fall into this group. Arguably, the latter would be of great importance for drug discovery, as they are often involved in disease-relevant signalling functions.
“To get better predictions on protein-protein interactions, you would need better input data, but it seems unlikely that researchers will start creating experimental data on protein structures just to train the model,” says Schauperl. He also believes the training bias could limit AlphaFold2’s usefulness in de novo protein design of, for example, antibody drugs: “The model is trained on stable proteins, and not good at recognising instability.”
Another challenge lies in AlphaFold2’s accuracy. The DeepMind team admits that only 36% of AlphaFold2’s predictions are currently precise enough on the atomic level to be useful for drug design, while it can confidently predict at least three quarters of the amino acid sequence for 44% of human proteins. How useful is this level of accuracy in the hands of drug discovery researchers?
“One of the good things about AlphaFold2 is that it gives you confidence levels for different parts of the protein – so you know which ones are trustworthy,” says Schauperl. “Generally, it works best on the domain level, but we have also used AlphaFold2 predictions to get ideas how domains might be connected to one another by binding loops. It could be quite useful. But we need to be cautious with these predictions nonetheless, as they are probably not as reliable as experimentally derived ones.” What additional features would he like to see from AlphaFold2? “Having structures with ligand binding would be really useful for us.”
AlphaFold2 in biological context
In biology, context is everything. In the future, the deluge of new protein structures AlphaFold2 generates will be put into context by computational and experimental exercises. This functional knowledge has every chance of leading researchers down hitherto unexplored therapeutic avenues and add to contextualized disease understanding.
At idalab we offer a unique protein scoring platform, which integrates various levels of protein information including AlphaFold based structural information, and which can be tailored to the specific needs of drug developers, aiming to identify protein classes that are suited to their specific targeting approach, from classical small molecule inhibitors to covalent inhibitors, PROTACS or antibodies. Visit our website or book a call, to learn more about our unique, integrated protein platform.
Further reading
The AlphaFold2 EBI database, which contains predictions of protein structures of nearly all human proteins, as well as protein structures of relevant model organisms
Publication of AlphaFold2 methodology and source code in Nature
Comment in Nature discussing the implications of publication of the AlphaFold2 database
Extensive blog post from ICR, reflecting on the historical implications of AlphaFold2 and its applicability for drug discovery