Data-centric AI is an emerging class of AI that focuses on “data” rather than the model. Generally, machine learning (ML) techniques are model-centric; they attain a static environment within which the model performs. However, most AI applications in the real world cannot always function in a static environment because several processes incorporating a machine learning approach require different processing and monitoring capabilities. To make AI applications more efficient and standardized, models are shifting to a data-centric approach or a combination of both. The new data-centric approach focuses on studying, analyzing, and utilizing data for decision-making.
Andrew Ng, a deep-learning pioneer, and founder of Landing AI has become a vocalist for data-centric AI as he believes everything comes down to data. If data is carefully prepared, organizations can accomplish the same goals with much lesser of it. To reach this stage, all organizations must shift to a data-centric approach to reap the maximum benefits of using artificial intelligence.
This shift in approach has pushed for “Active Learning (AL)” to reduce manual efforts of sampling and labeling data in ML models. Active Learning shortlists the most representative data samples for training and sends them for labeling. Only the selected sub-datasets are fed into the model to obtain more competitive results, save labeling time, and reduce training costs. While this approach saves manual data handling time, users must build an extensive backend to run those active learning pipelines. Consequently, significant engineering and coding work make active learning application challenging.
To overcome the issue, the newly proposed system at the National University of Singapore, named Active-Learning-as-a-Service (ALaaS), runs multiple strategies on datasets and performs the desired tasks by building pipelines. The server-client architecture, data manager, and AL strategy zoo are the three major components of this framework.
The system adopts a “server-client” architecture to perform scheduled jobs, making it compatible with individual devices and clouds. This architecture abstracts all necessary algorithms into web-based services users can directly use. Users only need to follow the suggested guidelines while creating a configuration file with basic settings like dataset path and desired techniques. Users can then initiate the client and the server with only a few lines of code (LoCs).
Once a user uploads the dataset and initiates the server, the data manager becomes responsible for it. The manager stores the metadata and indexes the samples to avoid redundant data movements.
Finally, the AL strategy zoo abstracts the desired strategies, like Bayesian, density-driven, batch selection, etc.
Besides the three main components, others, like the model repository and serving engine, help automate the AL application by enabling connections with public hubs like HuggingFace, TorchHub, etc., and calling other ML serving backends for inference.
With its smartly designed architecture and components, ALaaS promises three key improvements: efficiency, modularity, and accessibility.
ALaaS makes it much more convenient to leverage active learning by provisioning optimization technologies, including pipeline generation, ML backend adoption, etc. As active learning mainly faces large-scale datasets while employing multiple computational deep learning (DL) models, dataset processing, application development, and ML backend adoption are highly crucial for efficiency.
ALaaS leverages a user-friendly experience by implementing containerized active learning services to ensure that even non-technical users can conveniently use it without getting into code details, making it highly accessible.
Active learning is advancing rapidly, mainly due to the development of deep learning techniques. Making active learning more accessible should not prevent professionals from using it for more complex projects. Due to the very modular nature of ALaaS, professionals may quickly prototype, extend, and deploy cutting-edge, state-of-the-art active learning approaches.
Such an MLOps system, as described in “Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI,” for leveraging data-centric AI is a significant advancement in Machine-Learning-as-a-Service. More crucially, ALaaS integrates batching, cache, and stage-level parallelism (a pipeline approach) to increase the effectiveness of active learning operations. Results from the study also show that active learning offers low latency and higher throughput than running individual jobs.