Where once were black boxes, new LANTERN illuminates
Researchers at the National Institute of Standards and Technology (NIST) have developed a new statistical tool that they have used to predict protein function. Not only could it help with the difficult job of altering proteins in practically useful ways, but it also works by methods that are fully interpretable — an advantage over the conventional artificial intelligence (AI) that has aided with protein engineering in the past.
The new tool, called LANTERN, could prove useful in work ranging from producing biofuels to improving crops to developing new disease treatments. Proteins, as building blocks of biology, are a key element in all these tasks. But while it is comparatively easy to make changes to the strand of DNA that serves as the blueprint for a given protein, it remains challenging to determine which specific base pairs — rungs on the DNA ladder — are the keys to producing a desired effect. Finding these keys has been the purview of AI built of deep neural networks (DNNs), which, though effective, are notoriously opaque to human understanding.
Described in a new paper published in the Proceedings of the National Academy of Sciences, LANTERN shows the ability to predict the genetic edits needed to create useful differences in three different proteins. One is the spike-shaped protein from the surface of the SARS-CoV-2 virus that causes COVID-19; understanding how changes in the DNA can alter this spike protein might help epidemiologists predict the future of the pandemic. The other two are well-known lab workhorses: the LacI protein from the E. coli bacterium and the green fluorescent protein (GFP) used as a marker in biology experiments. Selecting these three subjects allowed the NIST team to show not only that their tool works, but also that its results are interpretable — an important characteristic for industry, which needs predictive methods that help with understanding of the underlying system.
“We have an approach that is fully interpretable and that also has no loss in predictive power,” said Peter Tonner, a statistician and computational biologist at NIST and LANTERN’s main developer. “There’s a widespread assumption that if you want one of those things you can’t have the other. We’ve shown that sometimes, you can have both.”
The problem the NIST team is tackling might be imagined as interacting with a complex machine that sports a vast control panel filled with thousands of unlabeled switches: The device is a gene, a strand of DNA that encodes a protein; the switches are base pairs on the strand. The switches all affect the device’s output somehow. If your job is to make the machine work differently in a specific way, which switches should you flip?
Because the answer might require changes to multiple base pairs, scientists have to flip some combination of them, measure the result, then choose a new combination and measure again. The number of permutations is daunting. More