VLMs vs. CNNs: Is a New Era Dawning in Computer Vision Performance?

VLMs vs. CNNs: Is a New Era Dawning in Computer Vision Performance?

The meteoric rise of large language models (LLMs) like ChatGPT and GPT-4 captured our imagination with their powerful abilities for understanding and generating human-like text in areas of the natural language process (NLP). This remarkable advancement led to a surge of excitement and speculation about the potential of their visual counterpart—vision language models (VLMs)—to revolutionize computer vision with their multifaceted ability to process textual and visual information seamlessly. Considering this ability, it's plausible to suggest that VLMs are better than conventional CNNs for computer vision tasks.

Despite this growing speculation, crucial questions remain: Are VLMs truly the silver bullet for all computer vision tasks, or are the VLM speculations a potential oversight of their abilities and benefits?

In this article, we explore the fundamental differences between CNNs and VLMs. We then carry out a practical comparison and in-depth analysis of the performance of these two model types on two computer vision tasks to assess whether VLMs are ushering in a new era in computer vision or if CNNs are still very relevant.

From Convolution to VLMs: Have VLMs Disrupted the Status Quo of CNNs in Computer Vision Applications?

TL;DR: The fundamentals of the CNN vs. VLM debate start with examining the differences in architectural designs. CNNs use a strong inductive bias for spatial information to capture local patterns and progressively build up to more abstract representations. This often excels at tasks that require precise localization and fine-grained detail. On the other hand, VLM's incorporate vision transformers (ViTs), allowing them to model complex relationships and leverage multi-modal information, making them better suited for tasks that require a deeper understanding of image content and its connection to language.

CNNs rely on convolutional layers to extract hierarchical features from images. They use inductive bias processing techniques to learn spatial relationships. This technique applies filters that scan the grid-like image through the convolution layers, allowing them to capture local patterns and progressively build up to more abstract representations. This inductive bias approach has proven highly effective for tasks like image classification and object detection, where understanding spatial relationships is key.

Conversely, VLMs often incorporate vision transformers (ViTs), which extend the transformer architecture to image processing by dividing images into patches and treating them as sequences of tokens. This allows VLMs to model complex connections between textual and visual data, potentially capturing global context more effectively than CNNs. This makes them well-suited for tasks that require understanding the semantic meaning of images and their connection to language, such as image captioning and visual question answering.

ViT is also very helpful in quickly exchanging information between those far-away pixels. Consider an image where you need more than one piece of information to understand what's going on, and those pieces are scattered throughout the scene. 

The CNN or VLM value propositions based on their architectural designs present different implications in terms of:

  • Compute Resource 
  • Scalability 
  • Maturity and extensive research

Compute Resource

The primary way to improve VLMs is to make them bigger. Because of the ViT architecture, they require a lot of data for training to achieve a good initial performance level. As a result, VLMs often require more training effort and computational resources to perform on par with CNNs. This makes VLMs require large computing resources for training and deployment.


CNNs can be trained and deployed more efficiently than VLMs, especially for large-scale datasets. This is due to their simpler architecture and the ability to leverage parallel processing hardware for faster computations.

Maturity and Extensive Research

CNNs have a long history of research and development, with a vast amount of literature, pre-trained models, and established best practices available. This maturity makes them a reliable choice for many computer vision tasks. While rapidly advancing, VLMs are still a relatively newer field with ongoing research and development.

After viewing these models' value propositions and implications through the lens of their core architectural designs, it is not easy to conclude that VLMs are entirely a better option than CNNs. However, their architectural differences already form the basis of their potential performance in practice on different computer vision tasks. Using a practical case study, let’s further analyze their differences, benefits, and implications.

Case Study

We want to integrate classification and object detection models into our computer vision platform to enable automatic labeling and annotation. The system runs on an H100. However, we are unsure if a CNN or VLM is optimal for our use case and system specification.

The aim is to conduct a comprehensive comparative analysis and explore each approach's advantages and limitations. To achieve this, we assess their capabilities based on the following criteria: accuracy (i.e., classification and detection), inference time, ability to generalize to unseen data, and practicality for integration with expanding computer vision platforms.

Classification Task

We use a defective or good tire dataset to determine the models' effectiveness in accurately classifying the tires' condition for the classification task.


We train a YOLOv8 model on the dataset. After training these models over 100 epochs, the results show a significant improvement in classifying tires as defective or good. The model correctly classifies good tires 80% of the time and defective tires 82% of the time.

Performance Table

Epochs Validation Accuracy Test Loss
100 0.8709677457809448 Follows a decrease until stabilization after 80 epochs


We finetuned several VLMs on the tire dataset, including LLAVA v1.5, LLAVA v1.5 LoRA, LLAVA v1.5 LoRA Finetuned, MiniGPT v2, and MiniGPT v2 Finetuned. 

Summary of Metrics Table

Model Performance Table
Model Test Set Accuracy Average Inference Time (seconds)
LLAVA v1.5 0.704 0.3
LAVA v1.5 LoRA 0.615 0.2
LLAVA v1.5 LoRA Finetuned 0.615 0.2
MiniGPT v2 0.577 0.1
MiniGPT v2 Finetuned 0.959 0.1

The table shows that the MiniGPT v2 model significantly improved its accuracy after fine-tuning, increasing from 0.577 to 0.959 while maintaining a rapid inference time. On the other hand, the LLAVA v1.5 LoRA model did not improve its accuracy after fine-tuning, although the inference time was slightly reduced.

Result Summary 

At the end of the experiment, the VLM fine-tuned MiniGPT v2 model was the best-performing model, with 0.959 percent accuracy while maintaining a rapid inference time. However, the YOLOv8 model also demonstrates high efficiency, with a validation accuracy of about 0.87. This result suggests that CNNs remain competitive, especially when optimized for specific tasks and trained over an adequate number of epochs.

Considering these factors, VLMs can potentially improve outcomes beyond standard CNNs, especially after a targeted fine-tuning process. Nevertheless, the choice between a CNN like YOLOv8 and a VLM like Mini GPT v2 might depend on the specific application context, the complexity of the dataset, and the resources available for training and fine-tuning.

Object Detection

We use a vineyard “wine grapes” dataset for the objection detection task. The goal is to detect grape clusters within each image. We measure the performance and accuracy with which these advanced models can identify and locate grape clusters in a natural and varied environment, which is crucial for agricultural management and automated harvesting applications.


The results obtained after training the YOLOv8 model on the "Wine Grape" dataset for only 4 epochs suggest a promising performance. The confidence scores for detecting grape clusters range between 0.3 and 0.9, with an average of around 0.6, indicating relatively high accuracy for a model early in its learning phase. This demonstrates the model's ability to correctly identify grape clusters in diverse vineyard conditions.

  • The loss on validation boxes (val_box_loss) decreases steadily, suggesting that the model is more accurate in predicting object locations as training progresses.
  • The average precision on the intersection over union (mAP50) shows an upward trend, meaning the model can increasingly overlay its detection predictions with the actual annotations.
  • The recall metric also increases, suggesting the model lacks fewer relevant objects in the image.
  • The precision remains relatively stable, indicating that the number of false positives does not significantly increase during the early stages of training.

These results indicate that the model has good learning potential, even with fewer epochs. Maintaining consistency in predictions and the rarity of missed boxes demonstrate YOLOv8's advanced capabilities for object detection in natural environments.


MiniGPT-v2 and LLAVA v1.5 show the most promising results among all the fine-tuned VLMs. 


Before fine-tuning, LLAVA v1.5 was inconsistent in detecting grape clusters, with missed detections and sometimes imprecise identification of areas of interest. These errors can be attributed to the model's insufficient generalization or difficulty distinguishing grape clusters in a complex environment like a vineyard.

The results after fine-tuning LLAVA v1.5 show that, despite improvement, the model still struggles to achieve a high level of accuracy in detecting grape clusters. Even after extensive training, the model still fails to consistently detect and identify relevant regions, suggesting that it has certain limitations.

The fact that the model performs better on the training set but not on the test set raises concerns about its generalization capability. This suggests that the model might be overfitting or memorizing specific details of the training data instead of understanding the general characteristics needed to perform well on new, unseen data.

These observations indicate that LLAVA v1.5, although an advanced visual language model, may not be suitable for object detection in highly variable scenarios without additional adjustments or exploration of alternative training techniques.


For MiniGPT-v2, when applied to the same "Wine Grape" dataset, the results also show significant room for improvement. Before fine-tuning, the model managed to make some accurate detections, but there are also inconsistencies in the dimensions of the bounding boxes and some difficulty in consistently identifying the region of interest.

After fine-tuning, the localization of grape clusters improves, indicating a better understanding of relevant visual characteristics. However, we observed disparities in the size of detections, which could lead to quantification errors in practical applications. Additionally, the inference time varies significantly, which poses a problem for real-time applications.

The presence of variations in inference time, even after fine-tuning, may indicate the model's volatility in the face of intrinsic variations in the test images. This raises questions about MiniGPT-v2's robustness and reliability for use in real-world conditions, where consistency and predictability are essential.

Result Summary

  • YOLOv8 proves promising, even after training for only 4 epochs, with metrics indicating an increasing ability to detect grape clusters. The validation loss and the average precision on the intersection over the union (mAP50) show that YOLOv8 could become highly efficient for detection in complex natural environments with extended training.
  • LLAVA v1.5, a VLM model, improved post fine-tuning but struggled with accuracy and generalization. This suggests the model could benefit from more diverse training data and special attention to avoiding overfitting. The notable difference between performance on the training and test sets indicates a limitation in the model's ability to generalize beyond the training data.
  • Mini GPT-v2 also improved after fine-tuning, especially in locating grape clusters. However, there remains variability in the size of detections and inference time that could impact its real-time application, raising concerns about its reliability for practical applications in precision agriculture

During the practical implementation of VLMs or CNNs, factors such as compute, scalability, etc., contribute to determining if they are the best option for your use case. The performance examination of our case study's classification and object detection tasks shows their varying strengths and weaknesses based on these factors. Therefore, you must remember that the choice between CNNs and VLMs depends on the specific task, available resources, and desired trade-offs. We hope this comprehensive analysis offers you valuable insights into computer vision and guides your decisions when choosing between VLMs and CNNs. Try Picsellia for your computer vision solutions.

Start managing your AI data the right way.

Request a demo

Recommended for you:

french language
english language