Data-Centric AI vs. Model-Centric AI? Which one will win?
Today, we've gone from model-centric to data-centric AI. We will walk you through the history of data-centric AI and where it's going in the next years!
Picsellia Team
Β·4 min read

Organize your visual data today
Version datasets, manage annotations, and track lineage from one place.
Artificial Intelligence Lifecycle: A Brief Introduction
As you all know, AI projects are made of 3 main parts. The first one is the training data, which must be stored, managed, cleaned, etc. Then, data is used to train Deep Learning models, for which we will iteratively run experiments to optimize performance metrics. Finally, when the results are good enough, we will deploy the model in production so the business application can use it, either on edge or in the cloud.
But, itβs not over...
Deep Learning models are created by learning from real world's data. However, as the world changes, data changes during the lifetime of the model. This is why models have to be retrained.
As you may understand by now, the AI model's performance is all about the data. That's why everyone has been talking about "Data-Centric AI" in 2021.
In this article, we will walk you through the history of Data-Centric AI and where it's going in the next months/years! π
β
Model-Centric AI β These days are gone
The focus on the model we talked about just before led to this state that AI has currently followed for many years.
From the words of the renowned Andrew Ng, AI systems are composed of Code + Data, Code being the model that is programmed using some frameworks in Python, C++, R, etc. And, the challenges for all research labs around the world was, for a given benchmark dataset such as the COCO dataset, create model architecture that would perform better and become the state of the art.
This is called a model-centric approach β keeping the data fixed and iterating over the model and its parameters to improve performances.
Sure, it was amazing for us ML engineers to easily have access to new and better models on Github and being able to create the best model for our project. For a lot of machine learning engineers, it gave us the feeling that, after studying ML theory so hard, we were finally applying this science package and trying to create something powerful.
The particularity of this period is that at the time, data collection was a one-off task, performed at the beginning of the project, with maybe the goal to make the dataset grow with time but with not much reflection about its inner quality.
The deployments of the model created were usually at a small scale; just one server or device could handle all the load and monitoring wasnβt such a thing.
But, the biggest hurdle was that everything was done manually: Data cleaning (rather normal), model training, validation, deployment, storage, sharing, and more.
It was obvious that there was a problem that needed to be solved. However, at that time, the solutions, such as big ML platforms, were either inexistent or too complicated to apply for the majority of organizations.
β
Data centric ai vs model 66ed3bcc2b609286b640535d 615c099dd9d3ed35c4fe5ae9 capture 2520d 25e2 2580 2599 25c3 25a9cran 2520du 25202021 10 04 252019 06 34
β
**From Model-Centric to Data-Centric AI **
Times have changed, and some influential people in the field, such as Dr. Andrew Ng, started proposing some new paradigms to deal with model optimization, this time by focusing on data.
This approach is now called Data-Centric β you may have seen those words on a lot of startup websites, and they can have different meanings and applications, but I will start by introducing the concept.
A data-centric approach is when you systematically change or enhance your datasets to improve the performance of the model. This means that contrary to the model-centric approach, this time the model is fixed, and you only improve the data. Enhancing the dataset can have different meanings. It can include taking care of the consistency of the labels, finely sampling the training data, and choosing the batches wisely; not always meaning an effort to increase the dataset size.
As an example of how models trained on benchmark datasets can be improved, a study showed that on average, 3.4% of the data in those datasets was mislabeled (which can take a lot of different forms). Imagine the increase of performance possible by decreasing this number to 0!
But focusing that much on the data, as it should continuously flow since we deployed models that can collect the data they are doing predictions on, means that you have automated all the processes behind the model lifecycle, from training through validation, to deployment.
This discipline is called MLOps (for Machine Learning Operations). If you'd like to learn more about MLOps and its most important concepts, you can check our first article of our series here.
Related from Picsellia
Organize and version your datasets
Version, slice, and manage datasets with full traceability β from raw images to production-ready splits.
Explore Dataset ManagementTrain models your way
Use pre-built pipelines for YOLO, SAM2, and more β or bring your own code with PyTorch, TensorFlow, or Hugging Face.
Explore the AI LaboratoryStay up to date
Get the latest posts on computer vision, MLOps, and AI delivered to your inbox.
Related articles

Computer Vision Dataset Slicing
Discover dataset slicing, a technique used to divide large datasets into smaller parts to train and test models and improve model accuracy.

Best object detection datasets in 2024
Looking to train your object detection models? Discover a wide variety of high-quality object detection datasets to fuel your AI projects.

Data-Centric AI: A Guide to Improving ML Performance Through Data
Learn how the benefits of Data-Centric AI, a new paradigm focusing on improving data quality applies to computer vision.