The protein science revolution
Background
Protein research is well established. Elucidating protein sequences is straightforward, now that gene-sequencing technology is mature, and experimental techniques such as x-ray crystallography and nuclear magnetic resonance spectroscopy have allowed researchers to solve the three-dimensional structures of proteins. Structural characterization, however, takes time. Even with the advent of robotics and increasingly sophisticated x-ray beams,3 it can take many months to establish a new protein structure. Naturally, researchers are interested in building tools and models that can predict protein structures, and a competition called CASP (critical assessment of structure prediction) runs every two years to assess the quality of the predictions.
AlphaFold
The first tremors of the protein revolution came at CASP 13 in 2018,4 when Google DeepMind鈥檚 AlphaFold entry substantially improved on previous attempts to predict protein structure, even for those proteins for which no known similar structures exist. In CASP 14, AlphaFold 2 made further substantial improvements, to the extent that some considered the prediction on par with experimental methods. AlphaFold utilizes deep-learning techniques to make structure predictions and incorporates information from known protein structures, as well as from protein sequences similar to the target sequence. AlphaFold鈥檚 source code was released as open source, allowing the protein design community to leverage the tool in their own work and build upon existing methods.
From structure to function
Determining the structure of proteins is important, but manipulating them requires further research. We knew the structures of tens of thousands of proteins before AlphaFold, and so does knowing the structure of thousands more merely amount to scientific stamp-collecting? The key observation is that the same tools that help identify the structure of a protein could provide insights into its functions and how to modulate those functions, and even allow for designing new and useful proteins unseen in nature.
Large language models
Researchers began testing whether the same tools used to predict protein structures could be repurposed to design them.5 They found that structure-prediction tools could indeed be reverse-engineered to generate amino acid sequences. Others leveraged the idea that a sequence of amino acids is a bit like words in a sentence and used large language models akin to ChatGPT to make the link between amino acid sequence and the function of a protein.
Diffusion models
The most-recent breakthroughs in protein science come from the same domain as image-generating models like DALL-E. These can turn noisy versions of an image into its original form and then build on this by turning random noise into images representative of a text prompt. In a remarkable repurposing of this strategy, researchers have developed techniques that can generate novel protein structures by starting from random noise.6 The same techniques can be used to generate proteins of a user-specified shape, much in the same way DALL-E can generate pictures based on user input.
Implications for scientific and medical research
Protein science has progressed with breathtaking speed since AlphaFold burst onto the scene. Scientists now have access to tools that were barely imaginable just a few years ago, and they are demonstrating that the outputs from these methods accurately reflect reality. It is reasonable to assume that the coming months and years will be punctuated by further breakthroughs, given the novelty of the technology and broad interest from the scientific community.
Proteins as therapeutic targets
Emerging evidence shows that this new breed of protein-design techniques can create proteins that bind to other proteins.7 This is a key strategy for modulating protein function in biology, and tailoring this to specific needs may be a route to more precise and potent therapeutics. It is also possible to use predicted protein structures to elucidate potential drug designs.8,9
Defense against pandemics
Tentative reports tout the use of AI to understand the proteins associated with emerging viruses and to predict ways in which proteins in viruses such as SARS-CoV-2 could change to facilitate immune escape. These kinds of tools could become valuable weapons in preventing or mitigating future pandemics by accelerating efforts to combat viruses with effective drugs and vaccines.10
Elucidating new biological functions
The biological role of proteins is typically elucidated by techniques such as altering them and observing the impacts on the host organism or cells. For example, scientists can then drill down into why proteins play their roles 鈥 what they bind to and where they are found in the cell. Recent breakthroughs could move some of this work 鈥 such as finding proteins that bind to one another 鈥 out of the wet lab and into computers. A greater understanding of the roles of different proteins provides new insights into health and diseases, and new targets for therapeutic interventions.
Correcting protein defects
Understanding the implications of protein mutations or the ways in which they misfold can open avenues for new treatments. Knowing how cancer-causing mutations are impacting proteins can help with targeting rogue proteins and may form part of a personalized treatment strategy.
Better diagnostic tools
Natural proteins are incredibly effective at selectively sticking to things they are supposed to stick to. The human cell contains myriad things a given protein should not interact with, and relatively few that it should. Replicating this kind of specificity could, in principle, produce tests for any small molecules or biomarkers we want to detect. The interest around multi-cancer early detection tests for identifying abnormalities demonstrates the potential power of diagnostic tests; proteins can bind to a wider range of molecules than just DNA, and this could be the basis of future multi-disease tests that are powerful and non-invasive.
Reasons for caution
The protein science revolution is currently one of basic research, and we should be cautious about extrapolating breakthroughs into medical applications.
A good analogy for the protein science revolution of today may be the completion of the Human Genome Project in 2003. This was a monumental breakthrough in scientific research, which undoubtedly led to further breakthroughs that made a positive impact on human health. However, most mortality improvements since 2003 in advanced countries such as the U.K. have come from cardiovascular health, often largely attributed to statins rather than a better understanding of the genome. Another material driver of improvements has been reductions in lung cancer, a result of public health measures rather than a pharmacological breakthrough.
It is unclear whether a reductionist approach to understanding and manipulating biology is an efficient way to drive medical breakthroughs.11 Researchers and pharmaceutical companies have long used their knowledge of protein structures to try to design new drugs that will bind to important sites and modulate function. Despite some notable successes, only about 8% of drugs that enter clinical trials end up registered as new drugs.12 Of course, any increase in this efficiency is welcomed, but it is not clear how many newly registered drugs go on to materially increase the health span or lifespan of the population. In addition, promising results at the protein level might not be practical for clinical trials for many reasons; for example, therapeutics need to be synthesized, transported, and stored, and some proteins might not be amenable to this. Some promising therapeutics identified by computers will fail for these kinds of practical reasons, but we should acknowledge there have been AI-driven breakthroughs in this area too.13
Accessibility of treatments for the general public following any material breakthroughs must also be considered. Immunotherapies and personalized medicine are expensive, and some therapeutic strategies identified by new AI techniques may fall into the same category.
These concerns are not enough to dismiss the potential benefits promised by the protein science revolution; rather, they are a reminder to temper expectations for what it might deliver.