Back

/ 3 min read

CLIP & Multimodal Prompting

A paper review summary part of my master’s coursework in CSE597: Vision and Language

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

PDF| Paper Code

Summary

This paper presents a multimodal prompting-based approach to improve video classification performance by adapting a pre-trained image-text CLIP model to video. It freezes the CLIP model as the backbone to preserve the supervised and zero-shot capabilities. To adapt the model, it uses frame-level prompts for frame-level information, summary prompts to summarize information across a video clip, video-level prompts to model the data distribution and textual prompts to enhance text description. In the supervised training setting, the approach performs better than the state-of-the-art methods on the Something-Something-V2 (SSv2) datasets and uses the Kinetics-600 (K600), HMDB51, and UCF101 datasets for zero-shot experiments. This results from training schemes and ablation studies make this a promising approach for usage in video-classification tasks given the challenge of the lack of video-language datasets to train a model along with the associated cost of computation.

Strengths

One of the key strengths of this approach is the employment of a unified approach for supervised and zero-shot learning that reduces the computational requirements. Existing methods like XCLIP [Microsoft] have different training schemes with a minimal change in accuracy, which means that for supervised learning it [XCLIP] involves more number of epochs(30) and lesser frames(8) but for zero-shot learning (10) and more frames (32) with a large number of training parameters. However, Vita-CLIP (30 epochs/8 frames per clip) does not require different training schemes and uses a smaller number of parameters, outperforming the accuracy achieved by XCLIP. Additionally, the paper provides useful insights via the ablation studies that focus on the usage of different kinds of prompts (global/local and frame-level prompts) on the K400 dataset. Lastly, the literature review is comprehensive in stating the challenges associated and the motivation for this approach.

Possible directions for future work

Lower performance on SSV2 dataset than K400 for supervised setting when compared with cross-entropy methods. The reason for this has been modestly stated to be the fine-grained nature of the SSV2 classes. Additionally, it would be helpful to have a few examples that would resolve an apparent ambiguity in the difference between local and global prompts. This has also been highlighted as an issue in the publicly available github code repository. Additionally, it is assumed that ViTa-CLIP uses static prompts and another approach of learning motion cues in videos and utilizing motion-aware prompts holds more semantic information resulting in a slightly higher accuracy on the same datasets as explained here by Wang et. al. Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning. Continuing the same idea of using static prompts, it can be observed that the textual prompts use an open and more generalized vocabulary which suggests that there could be a possibility of improving the semantic information provided in textual prompts that could further enhance the accuracy of this model.

Conclusion

The contributions made by this paper are significantly important as it presents a neat approach to adapt the image-text CLIP model to adapt to video recognition and related downstream tasks. The multi prompting-based approach models the temporal information and video data distribution while also learning semantic information with the help of textual prompts. With no minimal finetuning and architectural modifications it is able to achieve a competitive performance on datasets in both zero-shot and supervised settings. The weaknesses stated are only suggested improvements and should not undermine the significant contribution this work makes in reducing the computational cost involved in training models for usage in vision and language tasks.