Open goals: will AlphaFold2’s ‘open science’ usher in a drug discovery revolution?

DeepMind has handed researchers the tools to implement AlphaFold2 and explore its ground-breaking potential for biology and drug discovery. But a tide of cheaper, more effective therapies is far from inevitable

The performance of DeepMind’s deep learning network, AlphaFold2, at the CASP-14 protein structure prediction challenge in November 2020 induced a deluge of excited headlines. They shared one sentiment: this technology had the potential to kickstart a revolution in biology1

Less than a year later, in July 2021, the DeepMind team followed up with two milestone events:

  1. The AlphaFold2 Protein Structure Database, which details the structures of 350,000 proteins, including 98.5% of the ~20,000 known human proteins, was completed and made public. Released in partnership with the European Bioinformatics Institute (EMBL-EBI), the database also contains the structures of proteins from 20 other species, including important model organisms such as mouse and zebrafish, plus pathogens such as malaria. The number of structures compiled in the database is predicted to increase to over 100 million – covering almost the complete set of proteins in the UniProT database – by the end of 20212,3 
  2. The DeepMind team published a description of their methodology in Nature; they also released AlphaFold2’s source code on GitHub, enabling researchers to validate, recreate and improve their methodology4,5 

These resources hand researchers the means to test the limits of AlphaFold2’s immense potential6. But is that enough? What else needs to happen for the technology to bring about a (scientific) revolution – and what will that mean for drug development? 

Lessons from the human genome project

The debate on what AlphaFold2 might mean for biology and drug development is reminiscent of the hype surrounding the so-called “completion” of the human genome project some 20 years ago. If the true power of scientific achievement can be measured in its ability to capture the imagination of the general public, then this monumental effort was right up there. It will revolutionise the diagnosis, prevention, and treatment of most, if not all, human diseases,” predicted former US president Bill Clinton7.

And yet these projects often fail to deliver the kind of seismic, immediate results the mainstream hope for. The much-anticipated scientific big bang didn’t happen – but the created knowledge has contributed to biology in less tangible ways. Some of the more innovative/revolutionary areas of drug development, such as cancer immunotherapy, and gene and cell therapies – plus personalised medicines, which rely on genomic profiling of individual patients – would not have been possible without it.

Deriving experimental protein structures can take researchers months; Alphafold dramatically speeds up the process

How does AlphaFold2 stack up next to this? Which steps of the drug development pipeline could be forever changed by the structural knowledge of proteins it provides? “The technology’s contribution will be at the early stages of drug discovery, until lead generation,” says Michael Schauperl, Machine Learning Engineer at HotSpot Therapeutics. “Here it can certainly be of help, giving us structures of proteins that have not yet been experimentally derived…” 

Certainly, the number of protein structures accessible to researchers has been doubled with the publication of AlphaFold2’s results – while the number of unique proteins has increased by a factor of six 8,9.The AlphaFold2 Database is a useful resource for drug developers,” says Schauperl. “The source code will likely be more interesting for the structure prediction community, but not as much for drug developers … However, another level where AlphaFold2 can make a contribution is faster confirmation of experimental structures. Deriving those can take researchers months to years, and the process also relies on having a template or initial guess of the protein structure Having the AlphaFold2 prediction as a starting point can considerably speed up this process.” There’s no doubt about it: AlphaFold2’s performance is truly astonishing, generating 350,000 protein predictions in only 48 hours10. “But I wouldn’t call it a revolution yet,” says Schauperl.

Derek Lowe’s blog posts on AlphaFold2 also sprinkle some water on the flames, identifying that the real obstacles for drug development may lie outside the realm of protein structures: “Our failure rate in the clinic is around 90% overall, and none of those failures were due to lack of a good protein structure11. Many problems in drug development – such as finding better targets, developing more reliable preclinical models, better toxicity prediction, and swifter clinical trials – can’t be solved with naked protein structures alone; a better contextual understanding is also required 10,11. 

AlphaFold2 in the real world

In biology, context is everything. Knowledge of gene sequence or protein structure in a vacuum will only get you so far. However, just as the human genome project brought us indirect benefits, might AlphaFold2’s treasure trove of information have useful, if unforeseen, knock-on effects? How about linking protein structure and function, for example? Any assistance AlphaFold2 and similar tools could offer in this endeavour could bring about long-term improvements in disease understanding and drug discovery. 

There are obstacles, though. In its current state, the AlphaFold2 structure database’s information is reductionist: only one representative UniProt sequence per gene is covered and the output is only made for one conformation (even though multiple conformations may exist). It takes no account of the effects of ligand binding, protein complexes, mutations or post-translational modifications.

Moreover, while AlphaFold2’s performance on predicting certain protein structures is stunning, it has by no means solved the so-called protein-folding problem (aka the Levinthal Paradox)12 – predicting the 3D structure that a protein will assume based purely on its amino acid sequence. Indeed, as is always the case with deep learning applications, AlphaFold2 is only as good as the data it is trained on – the 170,000 experimentally derived protein structures available in the protein data bank (PBD). This induces a bias towards experimentally well-studied proteins and precludes the accurate prediction of protein structures for which there is little or no training data. Proteins that have intrinsically disordered regions, as well as structures that are determined by interaction with other proteins or other biomolecules such as DNA, fall into this group. Arguably, the latter would be of great importance for drug discovery, as they are often involved in disease-relevant signalling functions. 

“To get better predictions on protein-protein interactions, you would need better input data, but it seems unlikely that researchers will start creating experimental data on protein structures just to train the model,” says Schauperl. He also believes the training bias could limit AlphaFold2’s usefulness in de novo protein design of, for example, antibody drugs: The model is trained on stable proteins, and not good at recognising instability.”

Training bias could inhibit AlphaFold2’s application, says Michael Schauperl: ‘The model is trained on stable proteins, and not good at recognising instability’

What, then, is the added value of AlphaFold2’s predictions for drug developers? The DeepMind team admits that only 36% of AlphaFold2’s predictions are currently precise enough on the atomic level to be useful for drug design5, while it can confidently predict at least three quarters of the amino acid sequence for 44% of human proteins5,10. How useful is this level of accuracy in the hands of drug discovery researchers? 

“One of the good things about AlphaFold2 is that it gives you confidence levels for different parts of the protein – so you know which ones are trustworthy,” says Schauperl. “Generally, it works best on the domain level, but we have also used AlphaFold2 predictions to get ideas how domains might be connected to one another by binding loops. It could be quite useful. But we need to be cautious with these predictions nonetheless, as they are probably not as reliable as experimentally derived ones.” What additional features would he like to see from AlphaFold2? “Having structures with ligand binding would be really useful for us.”

The open approach 

DeepMind’s approach for Alphafold2 data sharing has been called a “perfect example of the virtuous circle of open science” by EMBL Director General Edith Heard8. Trained using publicly available data from PDB, the algorithm’s outputs were made publicly available in the EBI-hosted database. On top of that, sharing the source code enables researchers to recreate and improve upon the technology. 

Will this example of altruistic data sharing be the new normal in the space of artificial intelligence in pharma? It seems unlikely. For biotechs operating in the machine learning space, where business models often rely on selling licenses, users tend to be presented with black box algorithms. Some commentators have even speculated that it was the imminent publication of the RoseTTAFold method in Science that motivated Deep Mind to release the AlphaFold2 code13,14

Whatever the motives, the examples of AlphaFold2 and the human genome project show that with massive investments – initially financial, but also in the form of the hours of effort made possible by the open-science approach – large-scale biological problems can at least be addressed (if not solved). Plus the knowledge accumulated along the way makes impressive additions to the scientific community and, hence, to drug development. And therein lies the possibility of a genuine scientific revolution. 

In the future, the deluge of new protein structures AlphaFold2 generates will be put into context by computational9 and experimental exercises. This functional knowledge has every chance of leading researchers down hitherto unexplored therapeutic avenues. How fruitful these new paths might prove – and how long this might take – remains to be seen. 

 

References:

  1. Short Nature article discussing AlphaFold2’s performance at CASP14, featuring quotes from different researchers in the field on AlphaFold2’s potential impact 
  2. Publication in Nature, describing the database 
  3. The AlphaFold2 EBI database, which contains predictions of protein structures of nearly all human proteins, as well as protein structures of relevant model organisms 
  4. Publication of AlphaFold2 methodology and source code in Nature 
  5. DeepMind’s authors’ notes on AlphaFold2 
  6. Comment in Nature discussing the implications of publication of the AlphaFold2 database 
  7. Article from 2020 article looking back on the human genome project and its ramifications. 
  8. Press release of the database launch from EBI and DeepMind
  9. Preprint of a community evaluation of AlphaFold2’s performance in different areas of protein prediction applications
  10. Extensive blog post from ICR, reflecting on the historical implications of AlphaFold2 and its applicability for drug discovery
  11. Science blog from Derek Lowe on AlphaFold2’s potential impact on drug discovery 
  12. idalab article, discussing the performance of AlphaFold2 at CASP-14 and its potential impacts on drug development 
  13. The Baker lab’s publication on RosettaFold in Science
  14. TechCrunch Article, discussing AlphaFold2’s potential and drawing comparisons to the Baker lab’s RoseTTAFold algorithm 

 

Contact

Paul von Bünau

Managing Director

Mobile
+49 (0) 173 24 16 000

E-Mail
paul.buenau@idalab.de

Address
Potsdamer Straße 68
10785 Berlin