Data-centric AI vs Model-centric AI ? Which one will win?
October 5, 2021
Artificial intelligence Lifecycle: a short introduction
As you all know, AI projects are made of 3 main parts, the Training data, which must be stored, managed, cleaned, etc, which is then used to train some Deep Learning models where we will iteratively do experiments to optimize the performance metrics and finally when the results are good enough we will deploy the model in production so the business application can use it, either on edge or in the cloud.
But once you are there, it’s not over...
Deep Learning models are created by learning from real world's data, but the world changes thus data will change too during the lifetime of the model, meaning that models will have to be retrained.
As you may understand by now, the AI model's performance is much about the data, that's why everyone is talking about "Data-centric AI" in 2021.
In this article, we will try to walk you through the history of Data-Centric AI and where it's going for the next months/years. 🚀
Model-centric AI - These days are gone
The focus on the model we talked about just before led to this state that AI has currently followed for many years.
From the words of the renowned Andrew Ng, AI systems are composed of Code + Data, Code being the model that is programmed using some frameworks in Python, C++, R, etc… and the challenges for all of the research labs around the world was to, for a given benchmark dataset such as the COCO dataset, create model architecture that would perform better and become State of the art.
This is called a model-centric approach, keeping the data fixed and iterating over the model and its parameter to improve performances.
Of course, it was amazing for us ML engineers to have access to new and better models easily on Github and being able to experiment to create the best model for our project. For a lot of Machine Learning engineers, It gave us the feeling that, after studying ML theory so hard, we were finally applying this science package and trying to create something powerful.
The particularity of this period is that at the time, data collection was a one-off task, performed at the beginning of the project, with maybe the goal to make the dataset grow with time but with not much reflection about its inner quality.
The deployments of the model created was usually at a small scale, one server or device could handle all the load and monitoring wasn’t such a thing
But the biggest hurdle was that everything was done by hand, from data cleaning, which is rather normal, to model training, validation, deployment, storage, sharing, etc…
It was obvious that there was a problem and that some easy solutions could be provided to handle such issues, but the solutions were either inexistent or too complicated to apply for the majority of organizations, such as huge ML platforms.
Data-centric AI - Now
Now the time has changed, and some influential people in the field, such as Dr. Andrew Ng, start proposing some new paradigms to deal with model optimization, this time by focusing more on the data part.
This approach is now called Data-centric, you may have seen those words on a lot of startup websites, and they can have different meanings and applications, but I will start by introducing the concept.
A Data-centric approach is when you systematically change or enhance your datasets to improve the performances of the model. This means that contrary to the model-centric approach, this time the model is fixed, and you only improve the data.
Enhancing the dataset can take a lot of meaning such as taking care of the consistency of the labels, finely sampling the training data, and choosing the batches wisely, not always try to increase the dataset size.
As an example of how models trained on benchmark datasets can be improved, a study showed that on average, 3.4% of the data in those datasets was mislabeled (which can take a lot of different forms). Imagine the increase of performance possible by just decrease this number to 0!
But focusing that much on the data, as it should continuously flow since we deployed models that can collect the data they are doing predictions on, means that you have automated all the processes behind the model lifecycle, from training through validation to deployment.
This discipline is called MLOps (for Machine learning operation)
The first article of our series on MLOps will be available next Tuesday, we'll walk through all the MLOps concepts starting from scratch.