In this quickstart guide, we are going to train the M4C model on the TextVQA dataset. TextVQA requires models to read and reason about text in images to answer questions about them.
M4C is a recent SOTA model on TextVQA which consists of a multimodal transformer architecture accompanied by a rich representation for text in images.
To train other models or understand more about MMF, follow Next Steps at the bottom of this tutorial.
Install MMF following the installation documentation.
Datasets and required files will be downloaded automatically when we run training. For more details about custom datasets and other advanced setups for datasets check the dataset documentation.
We can start training by running the following command:
The hyperparameters for training and for the experiment are in the experiment config
projects/m4c/configs/textvqa/defaults.yaml. We can also set config params using command line args:
training.batch_size=32 will set batch size to 32 and
training.max_updates=44000 will set max iterations to 44000 for the training.
Similarly, log interval, checkpoint interval and validation interval can all be set as:
This will show training logs every 10 iterations, checkpoint models every 100 iterations and run validation every 1000 iterations. More about configurations and how we set them can be found here.
For running inference or generating predictions, we can specify a pretrained model using its zoo key and then run the following command:
For running inference on the
val set, use
run_type=val and rest of the arguments stay the same.
checkpoint.resume_zoo is loading a pretrained model from model zoo. To learn more about checkpoints and pretraining/finetuning models check this tutorial.
These commands should be enough to get you started with training and performing inference using MMF.
If you use MMF in your work or use any models published in MMF, please cite:
To dive deeper into MMF, explore the following topics next: