This repository contains the code for pytorch implementation of ViLT model, released originally under this (repo). Please cite the following papers if you are using ViLT model from mmf:
- Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}. In 38th International Conference on Machine Learning (ICML). (arXiV)
Follow installation instructions in the documentation.
To train ViLT model on the VQA2.0 dataset, run the following command
To finetune using different pretrained starting weights, change the
pretrained_model_name under image_encoder in the config yaml to reference a huggingface model.