'Off label' use of imaging databases could lead to bias in AI algorithms
Significant advances in artificial intelligence (AI) over the past decade have relied upon extensive training of algorithms using massive, open-source databases. But when such datasets are used “off label” and applied in unintended ways, the results are subject to machine learning bias that compromises the integrity of the AI algorithm, according to a new study by researchers at the University of California, Berkeley, and the University of Texas at Austin.
The findings, published this week in the Proceedings of the National Academy of Sciences, highlight the problems that arise when data published for one task are used to train algorithms for a different one.
The researchers noticed this issue when they failed to replicate the promising results of a medical imaging study. “After several months of work, we realized that the image data used in the paper had been preprocessed,” said study principal investigator Michael Lustig, UC Berkeley professor of electrical engineering and computer sciences. “We wanted to raise awareness of the problem so researchers can be more careful and publish results that are more realistic.”
The proliferation of free online databases over the years has helped support the development of AI algorithms in medical imaging. For magnetic resonance imaging (MRI), in particular, improvements in algorithms can translate into faster scanning. Obtaining an MR image involves first acquiring raw measurements that code a representation of the image. Image reconstruction algorithms then decode the measurements to produce the images that clinicians use for diagnostics.
Some datasets, such as the well-known ImageNet, include millions of images. Datasets that include medical images can be used to train AI algorithms used to decode the measurements obtained in a scan. Study lead author Efrat Shimron, a postdoctoral researcher in Lustig’s lab, said new and inexperienced AI researchers may be unaware that the files in these medical databases are often preprocessed, not raw.
As many digital photographers know, raw image files contain more data than their compressed counterparts, so training AI algorithms on databases of raw MRI measurements is important. But such databases are scarce, so software developers sometimes download databases with processed MR images, synthesize seemingly raw measurements from them, and then use those to develop their image reconstruction algorithms. More