AI has immensely contributed to computer vision’s advances over the past years. It’s the field where most AI research is being conducted. But what is going on during 2022, what are the most important present research trends ?
For this article we researched and studied the most important research trends so you don’t have to. We compressed them in one article so you need to look no further !
Multimodal Learning
The ultimate goal of artificial intelligence is to achieve Artificial General Intelligence (AGI). AGI is AI that is capable of understanding and performing any intellectual task, similar to human beings. That is in contrast to the current AI models that are “narrow”, in the sense that they are trained to perform just one task or very similar tasks.
Multimodal Learning is considered by many a step towards the direction of AGI. Multimodal models are able to process multiple types of data. Processing and capturing information from multiple different sources (ie. image, text, audio, sensor data) allows an AI agent to create a richer conceptual understanding and perceive its surroundings more holistically.
Imagine the task of sentiment analysis. You can train a model to detect emotion from face images. Conversely, you could train a model on sentiment analysis of language phrases or capture the voice tone of audio phrases. However, training on all 3 simultaneously allows the model to capture emotion cues from all sources, ie. facial expression, tone of voice as well as sentiment of the phrase being pronounced.
A lot of research has focused on Multi-Modal Learning during the last couple of years and some exceptional models, able to achieve SOTA performance in multiple tasks without fine tuning (zero shot learning) have been developed. Prominent examples include OpenAI’s CLIP model which was trained to model the similarity between images and their associated captions, and Meta’s FLAVA and data2vec. Another common approach these models have is that they utilize Self-Supervised Learning, which brings us to our next research trend.
Self Supervised Learning
Yan LeCun, the VP and chief AI scientist at MetaAI, called self-supervised learning the “Dark Matter of Intelligence”. Meta, but also other AI leader institutions are working very hard towards the direction of self-supervision as a new learning-paradigm that aims to replace classic Supervised Learning which has become unscalable.
Self-Supervised Learning (SSL) is a learning paradigm where pseudo labels are generated automatically from unlabelled data. These labels are not assigned based on the target task but rather on a general pre-training task that aims to teach and give the model “a general intuition about the world”. By training on this task with orders of magnitude more data compared to supervised learning, models are able to capture features that are otherwise difficult to discover. It offers an intelligent way of exploiting the unprecedented amounts of unlabeled data that is available today.
All the contemporary huge NLP models like GPT-3, BERT and the more recent BLOOM, make use of self-supervision in order to successfully be trained. SSL revolutionized NLP and is now revolutionizing the way we train vision models. But, it’s not just huge models that benefit, read our article to learn how you can reduce your training data needs by exploiting self-supervised learning.
Text-to-Image with Diffusion Models
Probably the most popular trend of 2022 is text-to-image Diffusion models. Models such as DALL-E 2 and Imagen have been making headlines across the news. In case you haven’t already heard, such models are able to generate original images from just natural language sentences !
AI generated images from human phrase prompts.
On a high level, these models have 2 key parts.
- A powerful semantic text-encoder. This can either be a multimodal text-encoder trained on image-text pairs, such as CLIP, or a large language model such as BERT. The text-encoder is responsible for capturing the complexity and semantic meaning of an arbitrary input sentence. It captures these features by projecting the text sequence in a high dimensional embedding space.
- A diffusion model that generates images from gaussian noise. Through a denoising procedure, diffusion models are able to generate novel images from pure noise! The denoising procedure is modeled as a Markov Chain whose parameters can be manipulated through a prior.
Since we want the diffusion model to create images inspired by a prompt phrase, the output of the text encoder is fed as an input, together with gaussian noise, into the diffusion model in order to guide the denoising process. The results speak for themselves!
These models are the deepest AI has penetrated into art, up to now. Many are afraid that such models will replace digital illustrators, painters and eventually other artists as well. However, they are probably going to serve as excellent support tools to such professionals and democratize the field more, rather than completely replace them.
3D Scene Perception
Perception in 3D space is a prerequisite in creating autonomous robotic systems that operate in real world conditions, such as autonomous driving, but also augmented reality applications. Relevant methods include 3D object detection, 3D panoptic segmentation , 3D depth estimation and many more.
The goal of 3D object detection is to recognize objects by drawing an oriented 3d bounding box (cube) around it and assigning a classification label to it. By detecting objects in 3D we get information about their size, their distance and orientation. This information can be then exploited by a navigation model in order to predict motion of objects in the scene, facilitate path planning and ultimately avoid collisions.
A labeled 3D object detection example from the objectron dataset. Source
Deep learning methods, suchs as CNN networks and in some cases ViTs, are extensively being used in solving this task. Currently the most popular training methods include using only images, using LiDAr sensor data or fusing both sources to create multi-modal datasets.
Although a lot of research is allocated towards reliable 3D perception systems and there is a lot of interest from the automotive industry for this technology, 3D object detection is still in an early stage. However, results in benchmarking datasets (eg. Kitti cars, sun-rgbd, objectron dataset) are very promising.
3D Scene Representations with Neural Radiance Fields (NeRF)
During 2020 a paper called “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” was published. Since then, it inspired a high number of research papers to follow in the same direction for 3D scene view synthesis and research is still ongoing. It’s a technology with very promising applications in computer graphics, augmented reality and possibly medical applications.
The goal of this task is to synthesize novel views of an object given a 5D input. The inputs given are the spatial coordinates x=(x, y, z) of the viewer (camera) and the viewing angle (θ,φ). NeRF maps this 5D space to a single volume density σ and a view-dependent RGB color c. It achieves that by using an optimized deep fully connected neural network. Training follows a supervised paradigm, hence multiple views of the object are needed for training purposes.
NeRF managed to achieve excellent results compared to its predeceasing techniques sparking a small revolution in its field.
Synthesizing a novel view of an input 2D image with NeRF and other predeceasing methods. Source
Explainable AI and uncertainty quantification in CV
Finally, a trend that is becoming prevalent in not just computer vision but all of the AI field.
All these amazing deep learning techniques that we use today have managed to move the needle of computer vision state of the art by a huge margin. However, as models get increasingly more complex and large it becomes progressively more difficult to interpret why and how they make predictions.
Deep learning has always been considered a black box technology, however, as DL models make their way into our everyday life, often governing important decisions such as medical diagnosis, it’s of paramount importance to know WHY these models have made a decision. It’s also extremely important to know how confident a model is of its prediction and how much trust we can have in it. These motivation has given rise to the field of Explainable Artificial Intelligence (XAI)
A lot of funding is being allocated towards explainable AI and computer vision claims a large share of it. Some popular methods into creating more transparent models include CAM, Grad CAM ++[5], RISE [6], SHAP Gradient Explainer. Related software tools and libraries include ELI5, interpretML, tootorch, tf-explain, shap.
Grad CAM++ algorithm applied to a CNN. Colors represent filter activations. A hotter color means more emphasis was given on those pixels by the model.
Explainable AI techniques will make it possible to confirm existing knowledge, challenge existing knowledge but also generate new assumptions.
FINAL THOUGHTS
There is a lot going on in computer vision and it’s easy to lose track of every research trend. In 2022 we saw that very large vision models, especially those based on vision transformers, have been dominating. These models rely on self-supervised learning for training so we can assume that self-supervision is here to stay. Furthermore, we can observe that the latest research lies in the intersection of computer vision and NLP. Multimodality is still in its infancy, yet, it has already produced extraordinary results. It’s difficult to doubt that it will not prevail and we won’t see different deep learning fields unite. All of the above emphasize the need for more transparent models. Since AI has already penetrated deep into society, Explainable AI is more important than ever before.
REFERENCES
[1] Learning Transferable Visual Models From Natural Language Supervision
[2] FLAVA: A Foundational Language And Vision Alignment Model
[3] https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/
[4] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[5] RISE: Randomized Input Sampling for Explanation of Black-box Models
[6] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization