Hugging Face has released the largest hub of ready-to-use datasets for ML models to anchor challenges of NLP. The Dataset contains 900 unique datasets, more than 25 metrics and has more than 300 contributors. This library will support many novel cross-data research projects to standardize end-user interface and provide a lightweight frontend for internet-scale corpora.
When considering NLP use cases, Datasets play a crucial role in evaluating and benchmarking results. While supervised datasets help in fine-tuning models, large unsupervised datasets assist in pertaining and language modelling. A practitioner faces several challenges when dealing with different versions and documentation of Datasets, the primary cause being the lack of uniformity.
To put an end to this, Hugging Face designed a Dataset that not only addresses the associated challenges of Dataset management but also provides access to support community culture. As this Dataset is developed by community contribution, it inherently got the bootstrapping takeaway that consisted of a variety of languages, including — continuous data types, multi-dimensional arrays for images, audios, and videos.
The primary intent of Hugging Face while creating Dataset from public hackathon are:
- Each Dataset in the library has a standard tabular format, that facilitates proper version and citation. It needs just one line of code to download all the datasets.
- As large datasets are computation and memory efficient, Hugging Face can stream Datasets through the same interface facilitating tokenization and featurization.
- All the Datasets are tagged and documented with their usage, types, and construction.
In a short while, the company has managed to get a phenomenal rise as a startup with its transformer library that is backed by TensorFlow and PyTorch. It consists of many pre-trained models to perform text classification, information retrieval, and summarisation tasks. This library is extensively used by researchers at Google, Facebook, and Microsoft and downloaded a billion times.
As high-end technology is concentrated in the hands of a few powerful big companies, Hugging Face wants democratisation of AI to extend the benefits of emerging technologies to smaller organisations. Hugging Face Co-founder and CEO Clement Delangue believes that there is a disconnect between the research and engineering team in NLP, hence it aims to be the GitHub for Machine Learning.