FEderated LearnIng with a CentralIzed Adversary (FELICIA) — a federated generative mechanism enabling collaborative learning — has been recently proposed by researchers from Microsoft and the University of British Columbia to train models from private medical data.
What is the problem?
There has been an outcry among AI researchers to gain easier medical data access from varied sources to better train medical diagnosis models like disease detection and biomedical segmentation. Biased by the demographics, medical equipment types, and acquisition process, images from a single source would skew any models’ performance towards the source population. The model would then perform poorly for other populations.
Therefore, medical data owners, such as hospitals and research centers, share their medical images to access differently sourced data and cut their data curation costs. They mostly use the additional data to counter the bias arising from their limited data while keeping source data private from others. But the legal constraints complicate the access to external large medical datasets. Current legislation prevents the sharing and processing of datasets outside the source from avoiding privacy breaches. Thanks to lower data diversity involved in diagnostics, the very laws that safeguard patients’ privacy endanger their lives because of less powerful AI models.
What is the solution? Why are GANs involved?
Therefore, the researchers generate synthetic medical data to set the data imbalance right, using Generative Adversarial Networks (GANs) architectures to train models. GANs have two neural networks — adversaries — competing against each other. While one of the networks is a generator that produces fake data as real as possible, the other is a discriminator that discriminates between the fake and real data from the mixed input. The generated data is mixed with the real ones. In a zero-sum-game, both try harder and harder to beat each other. And the result is a generator network that can generate fake data closer to real ones.
What does the best solution look like?
The best solution is to build upon PrivGAN architecture that works locally on a single dataset and generates synthetic images. But another group of researchers showed that PrivGAN could be used in a federated learning setting. PrivGAN was designed to protect against membership inference attacks — noticeable patterns in outputs that leak training data. This robustness against training data leakage makes PrivGAN the candidate for Microsoft’s FELICIA that honors medical data privacy constraints.
What is the best way to implement the solution?
Microsoft’s FELICIA simply extends any GAN to a federated learning setting using a centralized adversary, a central discriminator with limited access to shared data. The sharing of data with the central discriminator depends on many factors such as use cases, regulation, business value protection, and infrastructure. Researchers used multiple copies of the same discriminator and generator architectures of a ‘base GAN’ inside FELICIA to test the mechanism. The central privacy discriminator (DP) is kept identical to the other discriminators except for the final layer’s activation. First, the base GANs are trained individually on the whole training data to generate realistic images. Then FELICIA’s parameters, jointly optimized using the base GANs parameters, are tuned to get real-like synthetic samples.
FELICIA’s federated loss function equally optimizes the local utility on local data and global utility on all users’ data. It means that successive synthetic images will have to be far better than the previous ones, both at the local and global levels. The hyperparameter λ, which balances the participation in global loss optimization, improves the utility contrary to the original PrivGAN loss.
Did Microsoft’s FELICIA work?
Yes, FELICIA’s images are clearer and more diverse than other GANs. It generates synthetic images with more utility than what is possible with only access to local images. The improved utility suggests that the samples cover more of the input space than those of the local GANs.
During multiple experiments, it was seen that combining FELICIA with real data achieves performance on par with real data while most results significantly improve the utility even in the worst cases. The improvement is particularly significant when the data is most biased. The more biased the dataset is, the more its synthetic data will benefit in utility. Excluding the first 10000 epochs, a FELICIA augmented dataset is almost always better than what is achieved by real images.
These results show that Microsoft’s FELICIA allows the owner of a rich medical image set to help an owner of a small and biased set of images to improve its utility while never sharing any image. Different data owners (e.g., hospitals) now could help each other by creating joint or disjoint synthetic datasets that contain more utility than any of the single datasets alone. Such a synthetic dataset could be instrumental to freely share images within the local hospital and keep the real images secured and available to a limited number of individuals. This arrangement produces powerful models trained with shared data among research groups while maintaining confidentiality measures. A data owner can generate high-quality synthetic images with high utility while providing no access to its data.