Machine learning algorithms are fueled by data. Gathering relevant data is the most crucial and challenging step in creating a robust machine-learning model that can successfully execute tasks like image classification. Unfortunately, just because data is becoming more abundant does not mean everyone can use it. Real-world diverse data collection is complex, error-prone, time-consuming, and can cost millions of dollars to generate. As a result, getting reliable outcomes is generally out of reach since there is a dearth of credible training data that would allow machine learning algorithms to be trained more effectively. This is where synthetic data comes to the rescue!
Synthetic data is created by a computer using 3D models of environments, objects, and humans to swiftly make different clips of certain behaviors. It is becoming increasingly resourceful as synthetic data comes without the inevitable copyright constraints or ethical ambiguity that come with real data. It fills in the gaps when real data is scarce or current image data fails to represent the nuances of the physical world thoroughly.
By bridging the gap between reality and its representation, synthetic data prevents machine learning from committing errors that a person would never make. However, there is a significant bottleneck: the synthesis begins off simple but becomes more difficult as the quantitative and qualitative demand for the image data increases. You need to possess expert domain knowledge to develop an image data generation system that yields useful training data.
To address such issues, MIT researchers from the MIT-IBM Watson AI Lab captured a dataset of 21,000 publicly accessible programs from the internet rather than building unique image-generating algorithms for a specific training purpose. These programs generate a wide range of graphics using simple colors and textures. This includes procedural models, statistical image models, models based on the architecture of GANs, feature visualizations, and dead leaves image models. Then, they trained a computer vision model using this extensive collection of basic image-generating programs. The team explained that such programs generate a variety of graphics with simple color and texture patterns. The programs, each of which had only a few lines of code, were not edited or modified by the researchers.
According to the researchers, excellent synthetic data for training vision systems has two essential characteristics: naturalism and diversity. It’s interesting to note that the most naturalistic data is not necessarily the best because naturalism might compromise diversity. The primary goal must be to obtain naturalistic real data, which captures key structural aspects of real data.
The researchers didn’t feel the necessity to create images in advance to train the model since these simple programs ran so efficiently. In addition, the researchers discovered that they could produce images and train the model at the same time, which sped up the process.
The researchers pre-trained computer vision models for both supervised and unsupervised picture classification tasks using their enormous dataset of image-generating programs. While the image data in supervised learning are labeled, in unsupervised learning, the model learns to classify images without labels.
Compared to previous synthetically trained models, the models they trained with this large dataset of programs classified images more accurately. Besides that, the researchers demonstrated that adding more image programs to the dataset enhanced model performance, all while their models outperformed those trained using actual data, suggesting a new method for increasing accuracy.
The accuracy levels were still inferior to those of models trained on actual data, but their method reduced the performance difference between models trained on real data and those trained on synthetic data by an impressive 38%.
The researchers also employed each image generation software for pretraining in order to identify parameters that influence model accuracy. They discovered that a model performs better when a program generates a more varied set of images. They also discovered that the most effective way to enhance model performance is to use vibrant images with scenes that occupy the full canvas.
Through this research, the team emphasizes that their findings raise questions about the real complexity of the computer vision problem; if very short programs can create and train a high-performing machine learning computer vision system, then creating such a model may be simpler than previously thought, and might not require enormous data-driven systems to achieve adequate performance. Further, their methods enable training computer vision image classification systems that cannot get access image datasets. Thus addressing the expensive, biased, private, or ethical aspects of data collection. The research participants clarified that they are not advocating for completely eliminating datasets from computer vision (since real data may be needed for evaluation), but rather evaluating what can be done in the absence of data.