Protein for the human body is the fuel for growth and development. Every cell in the body consists of protein and its various forms. Many biologists and researchers have been interested in the structure and function of proteins since the 1960s when Christain Anfinsen proved in his experiment ‘protein folding and the thermodynamic hypothesis’ that proteins tie themselves together and form a physical structure called the folded protein structure. The nature of folded and unfolded proteins plays an important role in the functioning of the protein. To determine the folded protein structure was an impossible task for decades due to the insufficient technology and understanding of protein structures, but now AI-powered protein folding models suggest otherwise.
What is protein folding?
Protein folding is a fast and reproducible process where a protein chain is transformed into its distinct native three-dimensional structure for which the protein activates and can function biologically. This three-dimensional structure is usually a folded conformation held together by molecular interactions. The folded protein structure is determined by the sequence of amino acids present in the protein. Finding the right native structure is essential because only this specific structure can function normally in living beings; if it does not, it creates inactive or toxic proteins that cause malfunction and diseases. Therefore, predicting protein structures plays a vital role in preventing diseases. AI has powered significant ideas in protein folding models, where a deep learning model predicts the protein structure with the biological information collected in the past. AI-based protein folding models are advancing to predict protein structures whose sequences and structures are unknown.
Stages of protein folding
Protein folding is a complex process involving four hierarchical protein arrangements from primary to quaternary. Since the variation in amino acid sequences is huge, there are many different conformations in the protein structure.
- Primary: Primary structure of a protein is linear in formation and is the native conformation of a protein. This is the protein amino acid sequences held by peptide bonds.
- Secondary: Secondary structure is the stage where protein folding begins either with 𝞪-helices (alpha-helices) or 𝞫-sheets (beta-sheets). The 𝞪-helices are made when the backbone of protein is formed in a spiral shape, and 𝞫-sheets are made when the backbone bend over itself to form a sheet. The folds are rapid and stabilized by intramolecular hydrogen bonds. These hydrogen bonds are the strong electrostatic force of attraction between amine hydrogen and carbonyl oxygen of the peptide bond present in the protein.
- Tertiary: Tertiary structure is the folding stage where the secondary structures are connected. The 𝞪-helices and 𝞫-sheets are amphipathic in nature, meaning they have a hydrophilic and a hydrophobic side that helps form the protein’s tertiary structure. The folding occurs in such a way that hydrophilic sides face water or an aqueous environment and hydrophobic sides are away from the water. Once the structure forms and stabilizes through hydrophobic interactions, covalent bonds are also formed in disulfide bridges. Usually, the tertiary structure contains only one polypeptide chain, but in the presence of additional interactions of polypeptide chains, the structure becomes quaternary.
- Quarternary: Quaternary structure occurs in the protein folding process when multiple polypeptide chains are formed in the tertiary structure. These interactions of polypeptide chains are considered the assembly or coassembly of subunits of folded structures forming the fully functional quaternary protein.
Read more: Researchers use Deep Learning to Hallucinate synthesis of new proteins
Top five protein folding models
This is a list consisting of the top five protein folding models. To note, this list does not rank the protein folding models.
AlphaFold 2 is an AI model developed by Google’s DeepMind is a deep learning approach that includes physical and biological knowledge about protein structure and multiple sequence alignments (MSA). The model won at the 14th critical assessment of protein structure prediction (CASP14) held in November 2020 and earned the position of being one of the best protein folding models. CASP is a virtual competition of algorithms that predict the three-dimensional structure of proteins. AlphaFold 2 performed remarkably with accurate and reliable results at CASP14 compared to other protein folding models. The earlier version, AlphaFold 1 has a good reputation among protein folding models and came first in CASP13 in 2018.
The main difference between the two versions of AlphaFold is the system of training modules. AlphaFold 1 uses an independent training module system where the modules are trained with gradient descent to find the best fit based on the statistical potential calculating probability distribution of local free energy of the configuration.
AlphaFold 2 uses a sub-network system for training modules in an integrated way that the modules are coupled into an end-to-end model based on pattern recognition. The key aspect of the AlphaFold 2 model is the transformer design that progressively refines a vector of information for each relationship or bond. These bonds are between an amino acid residual of protein and another amino acid residual. Also, the bonds can be between each amino acid position and each different sequence alignment.
This refinement transformation approach in AlphaFold 2 has ‘attention mechanism,’ meaning it conveys the attention to relevant data by collecting it together and filtering out unnecessary data. In October 2021, DeepMind added an update to AlphaFold 2, now called AlphaFold-Multimer, which includes protein complexes in training data and has a success rate of 70% in predicting protein-protein interactions.
RoseTTAFold is a software tool using deep learning to predict protein structures developed by Minkyung Beak, Ph.D. at Baker lab. It is insightful towards protein function without a determined structure, making it faster to generate accurate protein-protein complexes. RoseTTAFold is based on a three-track neural network that integrates and processes one-dimensional protein sequence information and two-dimensional sequence information about the distance between amino acids at once.
The software allows the network to directly collect reasons and patterns in the relationships between peptides and folded architecture. As the information inside proteins flows back and forth in all structures, including 1-D, 2-D, and 3-D, the key is to generate accurate protein models from sequence information alone. As reported by Science, RoseTTAFold has proved to predict hundreds of new protein structures, including many not well-known proteins. In addition, RoseTTAFold is believed to have the potential to solve the challenge of x-ray crystallography and cryo-electron microscopy modeling problems. Lastly, the RoseTTAFold ecosystem aims at accurate protein structure prediction, progress identification with protein function, and focus efforts in the future on the major aspects of productivity.
ESMFold is a high-accuracy end-to-end atomic-level protein structure prediction model developed by Meta AI Research. It uses a transformer-based language model ESM-2, which is an updated version of the evolutionary scale modeling (ESM) model. The ESM model is capable of learning the interactions of bonds between amino acids in a protein sequence. Based on the ESM model, the ESMFold protein folding model predicts structure faster than any other protein folding models. It is built on a 15 billion parameters transformer model achieving the highest accuracy in predictions. Most protein folding models, including AlphaFold 2, RoseTTAFold, and so on, have a multiple sequence alignment approach to make predictions. However, ESMFold has a different approach to large-scale leveraging language models for protein prediction.
In this approach, the ESMFold model generates the structure prediction by taking account of only one input sequence and leveraging the internal representations of the language model. Meta has performed ESMFold computing on continuous automated model evaluatiON (CAMEO) and CASP14 test datasets and reported a compared evaluation with AlphaFold 2 and RoseTTAFold. The template modeling score (TM-score) of ESMFold was 83 and 68 respectively, on CAMEO and CASP14; in comparison, the TM-score of AlphaFold is 88 and 84, and RoseTTAFold is 82 and 81. The result on CASP14 is not that promising, so researchers noted the perplexity of the underlying language model. ESMFold has not been open-sourced yet like AlphaFold 2 and RoseTTAfold, but hopefully, the model will be in the future.
Read more: How Is Software Development A New Benchmark?
D-I-TASSER is a distance-guided iterative threading assembly refinement model uprooted from the I-TASSER method developed by Zhang lab. It is a high-accuracy protein structure and function prediction model built using the integration of threading and deep learning. The working of D-I-TASSER starts with a query sequence. Then, the generation of inter-residual contact maps, distance maps, and hydrogen-bond networks occurs using two multiple deep neural network predictors, including AttentionPotential (self-attention network built on MSA transformers) and DeepPotential. DeepPotential integrated the hydrogen-bonding restraints into its structural assembly simulations and was found to improve the accuracy of the model on CASP14 targets significantly. The large-scale tests performed on D-I-TASSER proved highly accurate than I-TASSER, including for the sequences that do not have homologous templates in the protein data bank.
OmegaFold is a high-resolution de novo structure prediction model from a primary sequence launched by HeliXon, a Chinese biotech firm, in July 2022. It works on divergent sequences instead of multiple sequence alignments preprocessing that other protein folding models, including AlphaFold 2, RoseTTAFold, and more work on. In the study of OmegaFold, researchers explained the new combinations of protein language model that allows them to make predictions from single sequences and suggest a geometry-inspired transformed model trained on protein structures.
This study fills the gap between protein structure prediction and understanding protein folding in nature. OmegaFold is ten times faster than RoseTTAFold and AlphaFold 2, outperforming RoseTTAFold and reporting to reach an accuracy as AlphaFold 2. The protein folding model has the ability to predict protein structure with only a single amino-acid sequence and not rely on known structures as templates. Under OmegaFold, the team of HeliXon introduced OmegaPLM, a deep transformer-based protein language model (PLM). OmegaPLM has the potential to catch structural and functional information encoded in the amino-acid sequences through embeddings. This information set is input to Geoformer, the geometry-based transformer neural network that further processes the structural and physical pairwise relationships between amino acids. In the end, a structural module predicts the output 3-D coordinates of heavy atoms forming the folded protein structure.