Model Zoo

Here is the current list of models currently implemented in MMF:

  • M4C Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [arXiv] [project]
  • ViLBERT ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [arXiv] [project]
  • VisualBert Visualbert: A simple and performant baseline for vision and language [arXiv] [project]
  • LoRRA Towards VQA Models That Can Read [arXiv] [project]
  • M4C Captioner TextCaps: a Dataset for Image Captioning with Reading Comprehension [arXiv] [project]
  • Pythia Pythia v0. 1: the winning entry to the vqa challenge 2018 [arXiv] [project]
  • BUTD Bottom-up and top-down attention for image captioning and visual question answering [arXiv] [project]
  • MMBT Supervised Multimodal Bitransformers for Classifying Images and Text [arXiv] [project]
  • MoViE Revisiting Modulated Convolutions for Visual Counting and Beyond [arXiv] [project]
  • BAN Bilinear Attention Networks [arXiv] [project]

In addition to the above MMF also has implementations of models ConcatBERT, ConcatBOW, LateFusion, Unimodal Text, Unimodal Image, VisDial, MMFBERT etc. We are adding many more new models which will be available soon.

Last updated on by Vedanuj Goswami