The most challenging part of machine learning is data cleaning because, on average, it takes 70% of the allotted time for a project. Now, there are Auto-ML systems that can handle the rest 30% of the work. But here, you have certainly made some assumptions just like the No Free Lunch Theorem predicts; a good model is always based on some assumptions. The question is whether you are aware of the beliefs? You will learn some of the assumptions that you may have made and their hidden cost.
Assumptions
The first one is ‘you have the data.’ Suppose you are building a facial recognition system, you can not deploy open-sourced pre-trained models directly. You have to fine-tune the model as per your local distribution. If the pre-trained model is trained on facial data sourced from dark-colored populations, no matter how accurate the model predictions are, it is bound to mess up when deployed on white-skin people. Hence, it becomes paramount to collect local training data for fine-tuning.
The second assumption is, ‘you have enough data.’ This belief gets tested once you plot the training and testing error. In case your model is overfitting, you are certainly going to fetch more validation data. However, large models require a more significant amount of training data. How would you amass a colossal amount of information? You have some fancy options like web-scraping, obtain open-source data of similar distribution, and/or buy data from different suppliers.
Of all assumptions, the most critical one is ’you have ample resources to go ahead with any of the above assumptions.’ You need to have trained human capital who can work on sufficient computing power with the budget you possess in terms of resources. Frankly, there is always a trade-off involved between the above three factors.
Also Read: What Is Liquid Machine Learning?
Hidden Costs
The worst part is that you may be unaware of the hidden cost of those assumptions. In the very beginning, we mentioned the time-share of a machine learning project, but there is no mention of data labeling. What would you clean when you have no idea what target the instances have. The same argument follows for the first assumption. When you do not have local data labels, having ‘THE’ local data for fine-tuning will not be useful. Therefore, the first hidden cost that we learned is labeled data availability.
The second assumption highlights the issue of data sources and their underlying presumptions. ‘Garbage In, Garbage Out’ is a rule of thumb in machine learning. If you believe the Internet is your abundant source of data, think again. Many blunders, recorded in the AI incidence database, will make you stay away from such an idea. Secondly, the labeling paradigm will differ if you use open-source datasets. Are you going to use manual labor to do the labeling again? Definitely not. And buying data will not give any advantage against your competitor because the seller is not restricted to make a deal with your enemy at the gate. Hence, the second hidden cost is data quality.
The problem with the third assumption is the trade-off between workforce, capital, and computing resources. Now ask yourself, how many MOOCs or courses have data labeling syllabus in them? Out of all the models you built, for how many of them did you annotate your data? How much time did you spare for data labeling arrangements in the machine learning workflow? Thus, the last hidden cost is intent.
Solutions
Till now, you might have a better understanding of the scenario you are facing as a data scientist or machine learning engineer. Let us now talk about the solution — a Training Data Platform (TDP). The startups and medium and small enterprises (MSMEs) need not build any in-house tools from scratch, saving investments to capitalize over offering other services and products. These provide a one-stop solution from data collection to labeling. Some even offer training provisions too.
Now, you can streamline your machine learning workflow in a single environment and save money and time. You need not force your capable workforce to dwindle around for fixes all day. The intuitive UI of the TDPs also make workforce training easy. The main mantra of TDPs is — annotate, manage, and iterate. The TDPs have automatic annotation that needs few well-annotated examples, and they annotate the rest. A reasonable TDP has collaboration built-in and also the support for APIs of other software. Similarly, the TDP should be agile enough to iterate over new data batches for boosting the accuracy of its models. Here are some TDPs that earned their place for scaling up to enterprise-grade data platforms – Labelbox, Appen, SAMA, Scale, Neuralmarker, Unidata, and more.