Before a machine-learning model can complete a task, such as identifying cancer in medical images, the model must be trained. Training image classification models typically involves showing the model millions of example images gathered into a massive dataset.
However, using real image data can raise practical and ethical concerns: The images could run afoul of copyright laws, violate people’s privacy, or be biased against a certain racial or ethnic group. To avoid these pitfalls, researchers can use image generation programs to create synthetic data for model training. But these techniques are limited because expert knowledge is often needed to hand-design an image generation program that can create effective training data.
Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere took a different approach. Instead of designing customized image generation programs for a particular training task, they gathered a dataset of 21,000 publicly available programs from the internet. Then they used this large collection of basic image generation programs to train a computer vision model.