Researchers from Skoltech Institute of Science and Technology, Lomonosov Moscow State University, and the Syntelly startup have created and trained a neural network that generates names for organic compounds using the IUPAC nomenclature system.
In their research published on Nature under Scientific Report, the team mentions creating a Transformer-based artificial neural network approach for translating between SMILES and IUPAC chemical notations: Struct2IUPAC and IUPAC2Struct. The Struct2IUPAC converts SMILES strings to IUPAC names and IUPAC2Struct performs the reverse conversion.
IUPAC or International Union of Pure and Applied Chemistry was founded in 1919 to harmonize the chemical naming of elements and organic compounds. For instance, in the IUPAC terms, sucrose is called (2R,3R,4S,5S,6R)-2-[(2S,3S,4S, 5R)-3,4-dihydroxy-2,5-bis(hydroxymethyl)oxolan-2-yl]oxy-6-(hydroxymethyl)oxane-3,4,5-triol, and paracetamol, the active ingredient of antipyretic drugs like Tylenol, is N-(4-hydroxyphenyl)acetamide. Since the IUPAC name comprises representing organic molecules’ names in the form of chemical structures using numbers and long names, it seems inconvenient to remember. Omitting even a single digit or symbol is unacceptable in the scientific domain.
Hence we have SMILES, or Simplified Molecular Input Line Entry System, which was created to make chemical information processing easier for both humans and computers. For example, Ethanol is written as CCO, which represents the molecule’s fundamental backbone, without any hydrogens: i.e., a carbon bonded with a carbon bonded to an oxygen. The best advantage to SMILES nomenclature is that many SMILES strings can describe the same molecule. For Ethanol, OCC and C(O)C are both acceptable.
According to Skoltech research scientist Sergey Sosnin, the team initially wanted to create an IUPAC name generator for Syntelly. However, they soon realized that it would take the team more than a year to create an algorithm by digitizing the IUPAC rules. Therefore they decided to leverage their knowledge and expertise in neural network solutions instead. Sergey is also the lead author of the study and co-founder of the Syntelly startup.
The team used the standard Transformer architecture with six encoder and decoder layers and eight attention heads as the basis for their research. The encoder layer creates an encoded representation of the words in the input data (latent vector or context vector). When a latent vector is provided to the decoder, it creates a target sequence by predicting the most likely word for each time step that pairs with the input word. Also, the Transformer uses an attention mechanism that looks at an input sequence and decides at each step which other parts of the sequence are important. This helps the neural network models to selectively focus on certain parts of their input and thus reason more effectively.
The team trained the Struct2IUPAC to convert a molecule’s structural representation to an IUPAC name and IUPAC2Struct for vice versa. They used PubChem, the world’s biggest open chemical library with over 100 million organic chemicals, to serve as the basis for the new network’s training and testing. The transformer neural network Struct2IUPAC learned to convert the names with almost 98.9% accuracy (1075 mistakes per 100,000 molecules) on a subset of 100,000 random organic molecules from the test set within six weeks of designing.
In recent years, the use of neural network approaches for solving chemical issues has grown in popularity. By treating molecules and reactions like words and sentences, they have found ways to get the computer to understand the chemical compounds. Yet, despite its enormous scope, the technology is still in its infancy.
Sergey says, “We have shown that neural networks can cope with exact problems, disproving the formerly prevalent notion that they should not be used for this kind of problem. Replacing a word with a synonym is possible in machine translation, whereas a single wrong symbol results in an incorrect molecule in our task. Yet, Transformer successfully copes with this task.”
The new solution has been integrated into the Syntelly platform and is available online. The researchers anticipate that their technique will be useful for converting between chemical notations as well as other technical notation-related activities like formula synthesis and software translation.