Sharing chemical knowledge between human and machine
Structural formulae show how chemical compounds are constructed, i.e., which atoms they consist of, how these are arranged spatially and how they are connected. Chemists can deduce from a structural formula, among other things, which molecules can react with each other and which cannot, how complex compounds can be synthesised or which natural substances could have a therapeutic effect because they fit together with target molecules in cells.
Developed in the 19th century, the representation of molecules as structural formulae has stood the test of time and is still used in every chemistry textbook. But what makes the chemical world intuitively comprehensible for humans is just a collection of black and white pixels for software. “To make the information from structural formulae usable in databases that can be searched automatically, they have to be translated into a machine-readable code,” explains Christoph Steinbeck, Professor for Analytical Chemistry, Cheminformatics and Chemometrics at the University of Jena.
An image becomes a code
And that is precisely what can be done using the Artificial Intelligence tool “DECIMER,” developed by the team led by Prof. Steinbeck and his colleague Prof. Achim Zielesny from the Westphalian University of Applied Sciences. DECIMER stands for “Deep Learning for Chemical Image Recognition.” It is an open-source platform that is freely available to everyone on the Internet and can be used in a standard web browser. Scientific articles containing chemical structural formulae can be uploaded there simply by dragging and dropping, and the AI tool will immediately get to work.
“First, the entire document is searched for images,” explains Steinbeck. The algorithm then identifies the image information contained and classifies it according to whether it is a chemical structural formula or some other image. Finally, the structural formulae recognised are translated into the chemical structure code or displayed in a structure editor, so that they can be further processed. “This step is the core of the project and the real achievement,” adds Steinbeck.
In this way, the chemical structural formula for the caffeine molecule becomes the machine-readable structure code CN1C=NC2=C1C(=O)N(C(=O)N2C)C. This can then be uploaded directly into a database and linked to further information on the molecule.
To develop DECIMER, the researchers used modern AI methods that have only recently become established and are also used, for example, in the Large Language Models (such as ChatGPT) that are currently the subject of much discussion. To train its AI tool, the team generated structural formulas from the existing machine-readable databases and used them as training data — some 450 million structural formulas to date. In addition to researchers, companies are also already using the AI tool, for example to transfer structural formulae from patent specifications into databases. More