Data privacy has become a serious topic of discussion at AI Conferences, board meetings of tech companies, and government events. While it seems like an elusive thing to achieve, researchers have proposed a number of technologies that can help us get close to this goal. One of the promising solutions is federated learning, where data computations are moved closer to data source than cloud.
Federated learning is a learning technique that allows users to collectively reap the benefits of shared models trained from rich data without the need to centrally store it, according to Google’s research paper Communication-Efficient Learning of Deep Networks from Decentralized Data. As stated above, federated learning brings the model to the data source (generally edge nodes) rather than bringing the data to the model. It then connects multiple computational devices into a decentralized system, allowing individual data collection devices to assist model training. This allows devices to collaboratively learn a shared prediction model while keeping all training data on the individual device. In this process, federated learning eliminates the need for large amounts of data to be moved to a central server for training purposes. As a result, it addresses our concerns about data privacy.
It is evident that artificial intelligence and machine learning models are data-hungry. And to build efficient and personalized AI-powered solutions, we need access to qualitative and quantitative data. Unfortunately, most of the data required by organizations lie scattered across different companies, user devices, nations, and more. Though, integrating all data sources sounds like an alluring easy fix, it is not practical due to constraints like security, user privacy, governmental regulations, and more. Federated learning seems like a lucrative alternative to these woes while complying with privacy regulation laws. Since data never leaves its original location, federated learning allows diverse data owners to interact and share their data at the organizational level. Further, federated learning has the potential to empower organizations to become less reliant on individual data monopolies while simultaneously generating more money from their own data via ‘protected sharing.’
There are two distinct categories of federated learning.
In instances when data sources share the same feature space but differ in samples, horizontal federated learning is used. Two regional banks, for example, may have quite diverse user groups from their separate regions, with a relatively tiny intersection set of users. However, because their businesses are so similar, the feature spaces are identical.
On the other hand, when two data sets share the same sample ID space but differ in feature space, vertical federated learning can be used. For instance, there are two separate companies in the same city: one is a bank, and the other is an e-commerce business in the same city. Because both user sets are likely to include the majority of the area’s people, their userspace intersects widely. However, because the bank keeps track of the user’s expenditure patterns as well as their credit score, and the e-commerce store keeps track of the user’s browsing and purchase history, their feature spaces are distinctive. To better understand vertical federated learning, consider Client A (Zomato) has information about the customer’s food item purchases on Zomato, while Client B (Dineout) has information about the customer’s restaurant reviews in Surat. Combining these two sets of datasets from different domains can better serve customers by using restaurant reviews information (Dineout) to provide better food recommendations to customers browsing food items on Zomato.
We also have federated transfer learning, whose architecture is an extension of vertical federated learning. Here, in the feature and sample dimensions, the data of various individuals do not overlap very substantially. To put it another way, it implies that datasets on separate clients have diverse sample spaces as well as feature spaces. Federated transfer learning may be used to train a custom model, such as movie recommendations, based on the user’s previous web browsing activity.
In both scenarios, the data owners can interact without jeopardizing the privacy of their individual clients. Vertical federated learning, in particular, offers a great application promise for partnering enterprises that possess data from the same set of users but have disjoint features to train models without disclosing their private data jointly. This allows businesses to build a cluster of multiple servers to participate in the federation with the increasing volume of training data and model size. Unlike horizontal federated learning, vertical federated learning enables a better and deeper understanding of the data field.
However, having too many network transfers inside each organization’s cluster might considerably influence the overall performance of the vertical federated learning operation.
Since vertical federated learning involves identifying the shared data items shared by all parties to prepare the training data, it relies on Private Set Intersection (PSI). PSI detects the intersection of training samples from all parties by using personally identifiable information (e.g., email) as sample IDs to align data instances. As a result, PSI makes intersection sample IDs public to all parties, allowing each party to verify if the data entities displayed in the intersection are also visible in the other parties, i.e. intersection membership. However, industries like healthcare may not be in favor of making information about their user base public. To solve this, researchers presented a framework based on Private Set Union (PSU) in a study published in 2021 that allows each participant to retain sensitive membership information to themselves. Rather than determining the intersection of all training samples, the PSU approach creates training instances from the union of samples.
Meanwhile, vertical federated learning is reported to be subject to backdoor attacks, which modify data from rogue agents during training, as well as inference-phase attacks, which influence test data. However, unlike traditional horizontal federated learning, boosting the resilience of vertical federated learning is difficult due to the lack of clear redundancy among the agents. Recently, scientists have come up with an advanced solution, i.e. robust vertical federated learning (RVFL). As per their under-review paper at ICLR 2022. RVFR can restore the underlying uncorrupted features with verifiable assurances under specific situations, therefore decontaminating the model against a wide range of backdoor attacks. In addition, RVFR protects against adversarial inference phase attacks and missing feature attacks.
There are other major challenges that need to be tackled before going for mainstream adoption of vertical federated learning. At present, vertical federated learning algorithms require intricate training procedures and time-consuming cryptographic operations to maintain privacy, resulting in a poor training pace of machine learning models at the edge. Besides, there’s a chance that cybercriminals can fish out private data from gradients of machine learning parameters during vertical federated learning. A catastrophic data breach would occur due to this. While homomorphic encryption is currently at play to prevent this, much development needs to be done.