In a recent announcement, DeepMind claimed that it has accurately predicted the three-dimensional structures of almost all cataloged proteins in existence. That includes more than 200 million proteins that can be found in practically anything, including people, animals, bacteria, plants, and plants. Using an AI technique called deep learning, DeepMind’s AlphaFold model can detect the 3D structure of a protein just from its 1D amino acid sequence.
Proteins which are composed of a ribbon of amino acids that folds up into a knot of intricate twists and turns, can be regarded as fundamental blocks of living beings. Because of the intrinsic flexibility of the amino acid building components, a typical protein may take on an estimated 10 to the power of 300 distinct forms. Every protein has its distinct folding configuration, so if one is altered, the protein may misfold and cease to function. Hence, understanding protein folding is highly important.
Consider a locksmith designing a key for a lock. The locksmith needs to be familiar with the structural design of the lock to be able to make the key. Now imagine the locksmith has no access to the information about the lock, they cannot create a key based on the ambiguity of the existence of the lock. Even if they successfully create one, there is no knowing it will work for the said lock. Suppose you think of medicine as a key and protein folds as a lock. In that case, you can see why researchers invest enormous time and effort decoding the folded, 3D structure of a protein they’re working with, much like the locksmith would start their key-making quest by putting together the lock’s mold. Knowing the precise structure makes it much simpler to predict where and how a molecule will bind to a particular protein as well as how that attachment can impact the protein’s folds while developing a cure.
It can take months in a lab to determine that fold—and subsequently, the function of the protein. Scientists have long experimented with automated prediction techniques like X-ray crystallography and cryo-electron microscopy to simplify the procedure. However, no method has ever come close to matching the precision attained by people. Further, they were expensive and time-consuming.
AlphaFold employs deep-learning neural networks trained on hundreds of thousands of experimentally confirmed protein structures and sequences in the PDB and other databases. When presented with a novel sequence, it initially searches databases for similar sequences that can reveal amino acids with a history of coevolving, indicating they are near in 3D space. Another method for estimating the distances between amino-acid pairs in the new sequence is to look at the structures of similar proteins that already exist.
As AlphaFold attempts to represent the 3D positions of amino acids, it iterates clues from these parallel tracks back and forth, continuously updating its estimate. It does this by using the “attention” concept to decide which amino-acid linkages are most relevant for its task at any particular time.
Read More: Is DeepMind’s Generalist AI Agent Gato, truly an epitome of AGI Models?
In December 2020, the second iteration of AlphaFold (AlphaFold2) made headlines when it won the Critical Assessment of Protein Structure Prediction (CASP) competition. The competition, which is held every two years, assesses advancement in one of biology’s most difficult problems: figuring out proteins’ three-dimensional (3D) forms only from their amino-acid sequence. In this event, the structures of the same proteins established by experimental techniques such as X-ray crystallography or cryo-electron microscopy, which fire X-rays or electron beams at proteins to build up a picture of their form, are compared to computer-software entries. After predicting structures to atomic accuracy with a median error (RMSD_95) of less than 1 Angstrom – 3 times more accurate than the next best system and comparable to experimental methods – it won CASP14 by a large margin. Further, it was acknowledged as a solution to the 50-year-old “protein-folding problem” by the organizers of CASP.
The scientific landscape had changed significantly since AlphaFold’s formal launch in July last year, when it identified about 350,000 3D proteins. To freely share this scientific information with the entire world, the Google subsidiary published and open-sourced AlphaFold one year ago and also developed the AlphaFold Protein Structure Database (AlphaFold DB). According to DeepMind, the AlphaFold DB acts as a “google search” for protein structures, giving researchers quick access to projected models of the proteins they’re researching. This allows them to concentrate their efforts and speed up experimental work. DeepMind stated that it had mapped 98.5 percent of the proteins used by the human body by the middle of 2021. It also predicted the entire ‘proteomes’ of 20 other widely studied organisms, such as mice and the bacterium Escherichia coli.
Scientists have made remarkable discoveries thanks to the AlphaFold Protein Structure database, which allowed users to see millions of protein structures. For instance, in April, Yale University researchers reviewed AlphaFold’s database to help them achieve their objective of creating a brand-new, potent malaria vaccine. And in July of last year, researchers at the University of Portsmouth employed the method to develop enzymes that will tackle pollution caused by single-use plastics. DeepMind supported World Neglected Tropical Disease Day by developing structural predictions for organisms recognized by the World Health Organization as high-priority for research, therefore advancing the study of illnesses like leprosy and schistosomiasis, which affect more than one billion people worldwide. DeepMind also plans to assist the Drugs For Neglected Diseases Initiative in the following years in identifying treatments for neglected yet widespread tropical diseases, including Chagas disease and Leishmaniasis.
Additionally, DeepMind’s publicly accessible protein structures have been included in other openly accessible databases, including Ensembl, UniProt, and OpenTargets, where millions of people use them on a daily basis.
With the recent release of predicted structures for virtually all cataloged proteins known to science in collaboration with EMBL’s European Bioinformatics Institute (EMBL-EBI), DeepMind has increased the AlphaFold DB’s size by more than 200x, from just under 1 million structures to more than 200 million structures. Researchers envision that this might significantly improve our knowledge of biology. With the inclusion of projected structures for plants, bacteria, animals, and other creatures in this release, researchers now have a wealth of new chances to utilize AlphaFold to further their study on vital topics like sustainability, food insecurity, and unrecognized illnesses.
The recent update will also result in the majority of pages on UniProt’s primary protein database having predicted structures. Additionally, all 200+ million structures will be available for mass download via Google Cloud Public Datasets, offering scientists all across the world even greater access to AlphaFold.
While it seems like Alphafold has achieved its biggest milestone, it is yet to overcome its own limitations to foster new research areas in drug discovery and the pharmaceutical industry. For instance, at present, AlphaFold cannot recognize how proteins alter in form when in contact with chemicals like medicines or other compounds that interact with proteins. Meanwhile, researchers are exploring ways to modify its training dataset and codes that will enable enhanced functionality – apart from its predictions for each amino-acid unit of a protein and associated confidence scores.