Today, data is ubiquitous and growing exponentially. It is still challenging for computers to understand and handle computer vision-based data like humans. In the past, computer vision techniques used brittle edge detection methods, color profiles, and a slew of manually coded processes that also required high-quality image annotations. They still didn't provide the most effective means for computers to understand the semantics of the data. Advances in techniques in machine learning presented an opportunity for computers to leverage big data and efficiently execute computer vision tasks.
Machine learning (ML) embedding has become an essential technique for handling various forms and types of data effectively and efficiently for ML tasks like computer vision, natural language processing (NLP), speech recognition, recommender systems, fraud detection, etc. This article discusses the fundamentals of applying the embedding technique to computer vision by breaking down the concept of image embedding using convolutional neural networks (CNNs).
What is Image Embedding
In a nutshell, embedding is a dimensionality reduction technique. It is a lower dimensional vector representation of high dimensional feature vectors (i.e., raw input data) like words or images. Technically, the concept entails creating dense clusters of similarities or relationships within the data, which serve as a semantic feature encoded in a vector space. These encoded features are unique identifying vectors for a particular data class. The dimensional vectors make it possible to efficiently manage data features that stand out, enabling the machine learning model to understand a data class better.
Generally, image embedding algorithms extract distinct features in an image and represent them with dense vectors (i.e., unique numerical identifiers) in a different dimensional space. The generated dense vectors are then compared against the vector of the image to measure similarities. Think of it as representing only the most distinct features of a 3-D image in 2-D and comparing how well the features appear in 2-D.
Methods for generating image embedding have evolved and become more advanced with the rise of deep learning (DL). Since the DL era, techniques like Bag of Visual Words (BOVW), Convolutional Neural Networks (CNN), and Visual Transformers (VIT) have been developed. Deep learning techniques use ML models that learn how to generate embeddings from images within the models and directly learn from the embedding weights rather than manually extracting embeddings (features) from images as a separate pre-processing step. They enabled the development of computer vision solutions for many datasets and use cases.
CNN Image Embeddings
At the time of this writing, CNNs are the de-facto standard in the CV field, with many practical and production use cases. However, they are computationally expensive and require a lot of data.
A CNN is a deep neural network model architecture containing two sets of blocks: convolutional and classification blocks. These blocks are the faces involved in generating image embedding with CNN. Each block of CNNs plays a specific role in extracting embeddings for the computer to understand the images, as we will dive into below. Although they are different CNN architectures like LeNet-5, AlexNet, VGGNet, ResNet, e.t.c., the fundamental process for extracting embedding is the same.
This block is responsible for extracting features from images and mapping the extracted features to image embeddings. As the name suggests, this block consists of convolutional layers. Each layer contains a filter and an activation function; in between, they also use other optional layers, which commonly include pooling and normalization layers. These layers provide additional benefits, such as regularization and improved training dynamics.
The convolutional layers can extract abstract features and ideas within an image and encode them as embeddings. It consists of several convolutional layers stacked on top of one another to enable them to recognize simple features, like edges, shapes, textures, etc. As the network gets more profound, it can capture more abstract and distinctive traits, which the models eventually use to identify a particular object's concept in an image.
Pixels with a color channel make up an image. The computer sees pixels and color channels as an array of vectors (matrix) with a value range of 0 (for no color) to 255 (for maximum color). These values represent the edges, shapes, and textures of different features in an image.
The convolutional layer filter reduces the image matrix to a lower dimensional representation by image compression. Filters are randomly initialized values with a smaller matrix shape (window size). The filter matrix is multiplied across the pixel values and returns a single value (scalar product) to represent that window portion of the image. As the filter matrix slides across each image window, it generates a complete feature map (i.e., a lower dimensional matrix embedding) of the image.
This process suppresses noise in the image to produce a smaller and smoother copy of the snippets that map the most prominent image feature, detected in the convolutional layer. These mappings are extracted embeddings that contain abstract qualities of the image. Generating embedding is done by compressing the image with CNN. Consider it is converting a video from 1080p to 360p; although the resolution is blurry, you can still identify objects within a frame because of their distinct shapes, colors, etc.
Since the extracted image embeddings in the lower dimension are smoother copies of the input image, it is essential to be mindful of excessive image compression to avoid losing vital feature information in the embedding. There are a couple of ways to control the amount of image compression.
Modifying the filter size and stride (i.e., the number of pixels a filter moves per window) is a way of controlling compression. Increasing the stride causes the filter to traverse the entire image in fewer steps, yielding fewer values and a more compressed feature map, and vice versa.
Using padding layers can also limit compression. The layer adds zero value pixels to the edge of the image vector; as a result, the filter has more pixel vectors and image windows to aggregate. Padding is a more effective remedy for smaller images, typically placed before the convolutional layer.
The pooling layer ensures robust embedding extraction, enabling stability when identifying the compressed information in an extracted embedding. It downsamples the embeddings to reduce the size of the feature maps by taking the maximum or average value of a group of neighboring pixels. It is helpful in cases where the pixels of the embedding features shift a bit out of place due to compression, and identifying the object becomes due to the slight deviation in shapes, edges, etc.
Before passing the embedding to the next layer within the convolutional block, the activation function in each convolutional layer applies non-linearity to the model, which allows the layer to learn the complex relationships between the image and the extracted embeddings. Rectified linear unit (ReLU), Exponential linear unit (ELU), Sigmoid function, and Tanh function are some of the most common activation functions used in a convolutional block.
The depth of the Convolutional layers is a critical component that contributes to the high performance of extracting useful embeddings. At every successive layer within this block, the embedding gets a more abstract understanding of the peculiar features of objects in the initial images. For example, there is an image of an iPhone 14 Pro Max. With the extracted embeddings in the first few convolutional layers, it recognizes there is a mobile phone in the image; by the intermediate layers, it pulls more embedding and can tell it's an iPhone, and with more embeddings by the last layer, its able to it identify it as an iPhone 14 Pro Max.
This part of the CNN is the fully connected linear layer. Typically located comes after the convolutional block. It takes the embeddings from the convolutional layers and calculates the probability of the feature embedding belonging to an object class.
This layer transforms the vector embeddings from a vector to a scalar data point. The data point embedding shows a more precise numeric representation of the abstract features as a cluster, making identifying an object's class easier. The clusters represent different features of an object class.
Image embeddings have revolutionized the field of computer vision by providing a compact and meaningful representation of images. With their ability to capture rich visual information, image embeddings have opened doors to numerous applications and paved the way for advancements in image analysis, interpretation, and generation.
However, the techniques for generating image embeddings are associated with typical challenges of sensitivity to image variations, computational complexity, and the need for large datasets for training. Nonetheless, ongoing research and innovation continue to address these limitations and improve the effectiveness and efficiency of image embedding techniques. As research and development in this field progress, image embeddings will undoubtedly continue to shape the future of computer vision.