December 19, 2024
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like…
Multimodal Learning seeks to allow computers to represent real world objects and concepts using multiple data streams. This post provides an overview of diverse applications and state-of-the-art techniques for training and evaluating multimodal models.
Modality refers to a type of information or representation format in which information is stored. In the context of Deep Learning, modality refers to the type of data a model processes. These data modes include images, text, audio, video, and more. By combining multiple data modes, multimodal learning creates a more comprehensive understanding of a particular object, concept, or task. This approach leverages the strengths of each data type, producing more accurate and robust predictions or classifications.
For example, in computer vision, a multimodal model can combine image and text data to perform image captioning or visual question answering. By processing visual and textual information, the model can provide more accurate and detailed image descriptions.
Multimodal Learning models have various applications, such as:
The main idea behind multimodal models is to create consistent representations of a given concept across different modalities. There are several ways to build these representations, but the most popular approach involves creating encoders for each modality and using an objective function that encourages the models to produce similar embeddings for similar data pairs.
One popular approach for integrating the representations of similar data pairs while separating the representations of dissimilar data pairs is contrastive learning. This is accomplished by defining a similarity metric that measures the distance between data pairs in the model’s latent space. The similarity between the representations is measured using cosine similarity or dot product. The model is then trained to minimize the distance between similar pairs and maximize the distance between dissimilar pairs.
In this approach, the encoder models are independent and produce embeddings that are similar but not exactly the same.
CLIP (Contrastive Language-Image Pre-Training) is a state-of-the-art multimodal deep learning model trained using contrastive learning. This approach leverages massive datasets of 400 million image and text pairs to train the model on a vast range of visual and textual concepts, enabling it to learn the relationships between images and text.
CLIP addresses some major problems in the standard deep learning approach to computer vision, such as:
CLIP has enabled numerous advancements in multimodal learning. DeepMind leveraged image encoders from their own CLIP-like models to create a unique training methodology that interleaves image tokens into text sequences to train their Flamingo LLM. StabilityAI’s Stable Diffusion model uses CLIP’s text encoder to help its generation model create images based on a text description.
Furthermore, CLIP facilitates the development of various types of zero-shot models for computer vision tasks, such as image classification and object detection.
Multimodal models are judged based on the quality of their representations. For models like CLIP, their zero-shot performance on tasks such as image classification is used as a proxy for overall performance. For example, CLIP achieves a Top 1 Accuracy score of 56% and a Top 5 Accuracy score of 83% on the ImageNet Dataset.
Models like Flamingo are also evaluated by comparing their zero-shot and few-shot performance on multimodal tasks such as visual question answering, image classification, OCR, and image captioning to task-specific models that have been trained with significantly more data.
Multimodal Learning is a rapidly growing and exciting field of computer vision and AI that has the potential to revolutionize how computers interact with the world. There has never been a better time to get involved in multimodal learning and explore the cutting-edge techniques used to train and evaluate these complex models. With its diverse applications and potential to transform numerous industries, multimodal learning offers many opportunities for researchers, engineers, and enthusiasts.