Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
This project page shows how to use M4C model from the following paper, released under the MMF:
- R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. in CVPR, 2020 (PDF)
Project Page: http://ronghanghu.com/m4c
#
InstallationInstall MMF following the installation guide.
This will install all M4C dependencies such as transformers
and editdistance
and will also compile the python interface for PHOC features.
#
Notes about dataThis repo supports training and evaluation of the M4C model under three datasets: TextVQA, ST-VQA, and OCR-VQA. As you run a command, these datasets and the requirements would be automatically downloaded for you.
For the ST-VQA dataset, we notice that many images from COCO-Text in the downloaded ST-VQA data (around 1/3 of all images) are resized to 256×256 for unknown reasons, which degrades the image quality and distorts their aspect ratios. In the released object and OCR features below, we replaced these images with their original versions from COCO-Text as inputs to object detection and OCR systems.
The released imdbs contain OCR results and normalized bounding boxes (i.e. in the range of [0,1]
) of each detected objects (under obj_normalized_boxes
key) and OCR tokens (under ocr_normalized_boxes
key). Note that the answers in ST-VQA and OCR-VQA imdbs are tiled (duplicated) to 10 answers per question to make its format consistent with the TextVQA imdbs.
For the TextVQA dataset, the downloaded file contains both imdbs with the Rosetta-en OCRs (better performance) and imdbs with Rosetta-ml OCRs (same OCR results as in the previous LoRRA model). Please download the corresponding OCR feature files.
Note that the object Faster R-CNN features are extracted with extract_features_vmb.py
and the OCR Faster R-CNN features are extracted with extract_ocr_frcn_feature.py
.
#
Pretrained M4C ModelsWe release the following pretrained models for M4C on three datasets: TextVQA, ST-VQA, and OCR-VQA.
For the TextVQA dataset, we release three versions: M4C trained with ST-VQA as additional data (our best model) with Rosetta-en, M4C trained on TextVQA alone with Rosetta-en, and M4C trained on TextVQA alone with Rosetta-ml (same OCR results as in the previous LoRRA model).
Datasets | Config Files (under projects/m4c/configs ) | Pretrained Model Key | Metrics | Notes |
---|---|---|---|---|
TextVQA (textvqa ) | textvqa/join_with_stvqa.yaml | m4c.textvqa.with_stvqa | val accuracy - 40.55%; test accuracy - 40.46% | Rosetta-en OCRs; ST-VQA as additional data |
TextVQA (textvqa ) | textvqa/defaults.yaml | m4c.textvqa.alone | val accuracy - 39.40%; test accuracy - 39.01% | Rosetta-en OCRs |
TextVQA (textvqa ) | textvqa/ocr_ml.yaml | m4c.textvqa.ocr_ml | val accuracy - 37.06% | Rosetta-ml OCRs |
ST-VQA (stvqa ) | stvqa/defaults.yaml | m4c.stvqa.defaults | val ANLS - 0.472 (accuracy - 38.05%); test ANLS - 0.462 | Rosetta-en OCRs |
OCR-VQA (ocrvqa ) | ocrvqa/defaults.yaml | m4c.ocrvqa.defaults | val accuracy - 63.52%; test accuracy - 63.87% | Rosetta-en OCRs |
#
Training and EvaluationPlease follow the MMF documentation for the training and evaluation of the M4C model on each dataset.
For example:
- to train the M4C model on the TextVQA training set:
(Replace textvqa
with other datasets and projects/m4c/configs/textvqa/defaults.yaml
with other config files to train with other datasets and configurations. See the table above. You can also specify a different path to env.save_dir
to save to a location you prefer.)
- To evaluate the pretrained M4C model locally on the a TextVQA's validation set (assuming that the pretrained model that you are evaluating is
m4c.textvqa.with_stvqa
):
As with training, you can replace dataset
, config
and checkpoint.resume_zoo
according to the setting you want to evaluate.
note
Use checkpoint.resume=True
AND checkpoint.resume_best=True
instead of checkpoint.resume_zoo=m4c.textvqa.with_stvqa
to evaluate your trained snapshots.
tip
Follow checkpointing tutorial to understand more fine-grained details of checkpoint, loading and resuming in MMF
- to generate the EvalAI prediction files for the TextVQA test set (assuming you are evaluating the pretrained model
m4c.textvqa.with_stvqa
):
As before, for generating prediction for other pretrained model for TextVQA, replace config
and checkpoint.resume_zoo
according to the setting you want in the table.
note
To generate predictions on val set, use run_type=val
instead of run_type=test
. As before, to generate predictions for your checkpoint, use checkpoint.resume=True
AND checkpoint.resume_best=True
instead of checkpoint.resume_zoo=m4c.textvqa.with_stvqa
.