The number of AI and, in particular, machine learning (ML) publications related to medical imaging has increased dramatically in recent years. A current PubMed search using the Mesh keywords “artificial intelligence” and “radiology” yielded 5,369 papers in 2021, more than five times the results found in 2011. ML models are constantly being developed to improve healthcare efficiency and outcomes, from classification to semantic segmentation, object detection, and image generation. Numerous published reports in diagnostic radiology, for example, indicate that ML models have the capability to perform as good as or even better than medical experts in specific tasks, such as anomaly detection and pathology screening.
It is thus undeniable that, when used correctly, AI can assist radiologists and drastically reduce their labor. Despite the growing interest in developing ML models for medical imaging, significant challenges can limit such models’ practical applications or even predispose them to substantial bias. Data scarcity and data imbalance are two of these challenges. On the one hand, medical imaging datasets are frequently much more minor than natural photograph datasets such as ImageNet, and pooling institutional datasets or making them public may be impossible due to patient privacy concerns. On the other hand, even the medical imaging datasets that data scientists have access to could be more balanced.
In other words, the volume of medical imaging data for patients with specific pathologies is significantly lower than for patients with common pathologies or healthy people. Using insufficiently large or imbalanced datasets to train or evaluate a machine learning model may result in systemic biases in model performance. Synthetic image generation is one of the primary strategies to combat data scarcity and data imbalance, in addition to the public release of deidentified medical imaging datasets and the endorsement of strategies such as federated learning, enabling machine learning (ML) model development on multi-institutional datasets without data sharing.