ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
This repository contains the code for pytorch implementation of ViLT model, released originally under this (repo). Please cite the following papers if you are using ViLT model from mmf:
- Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}. In 38th International Conference on Machine Learning (ICML). (arXiV)
#
InstallationFollow installation instructions in the documentation.
#
TrainingTo train ViLT model on the VQA2.0 dataset, run the following command
To finetune using different pretrained starting weights, change the pretrained_model_name
under image_encoder in the config yaml to reference a huggingface model.