Meta and Stanford University researchers have developed a new metric for pruning AI datasets. The metric will enhance training scalability by following a power-law relationship where additional data samples would be required to increase the performance by a few percentage points.
The pruning techniques used at present are either inefficient or severely compute-intensive. This new pruning algorithm will require much lesser computational time and is self-sufficient.
Researchers used statistical mechanics to show that proper dataset pruning can scale the performance by an exponential-decay relationship. Exponential-decay relationships require less additional sample data to output the same performance.
Meta’s researchers started by developing a theoretical model of data pruning and determining a ‘margin’ of the training example, where “easy” indicated a large margin and “hard” meant a smaller one.
They used K-means clustering on an embedding space. The pruning metric is the distance between the dataset example and the nearest cluster centroid. The researchers observed that the best pruning results depended on the initial dataset size. They concluded that as dataset size increases, the number of datasets required for pruning would also increase to achieve significant results via exponential decay.
This is not the first time that model performance has become the focus of a research project. In 2020, OpenAI also published research based on accuracy trends of NLP models. The research also prioritized dataset sizes as a factor affecting the model performance.