This project page shows how to use M4C model from the following paper, released under the MMF:
- R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. in CVPR, 2020 (PDF)
Project Page: http://ronghanghu.com/m4c
Install MMF following the installation guide.
This will install all M4C dependencies such as
editdistance and will also compile the python interface for PHOC features.
Notes about data
This repo supports training and evaluation of the M4C model under three datasets: TextVQA, ST-VQA, and OCR-VQA. As you run a command, these datasets and the requirements would be automatically downloaded for you.
For the ST-VQA dataset, we notice that many images from COCO-Text in the downloaded ST-VQA data (around 1/3 of all images) are resized to 256×256 for unknown reasons, which degrades the image quality and distorts their aspect ratios. In the released object and OCR features below, we replaced these images with their original versions from COCO-Text as inputs to object detection and OCR systems.
The released imdbs contain OCR results and normalized bounding boxes (i.e. in the range of
[0,1]) of each detected objects (under
obj_normalized_boxes key) and OCR tokens (under
ocr_normalized_boxes key). Note that the answers in ST-VQA and OCR-VQA imdbs are tiled (duplicated) to 10 answers per question to make its format consistent with the TextVQA imdbs.
For the TextVQA dataset, the downloaded file contains both imdbs with the Rosetta-en OCRs (better performance) and imdbs with Rosetta-ml OCRs (same OCR results as in the previous LoRRA model). Please download the corresponding OCR feature files.
Pretrained M4C Models
We release the following pretrained models for M4C on three datasets: TextVQA, ST-VQA, and OCR-VQA.
For the TextVQA dataset, we release three versions: M4C trained with ST-VQA as additional data (our best model) with Rosetta-en, M4C trained on TextVQA alone with Rosetta-en, and M4C trained on TextVQA alone with Rosetta-ml (same OCR results as in the previous LoRRA model).
|Datasets||Config Files (under ||Pretrained Model Key||Metrics||Notes|
|TextVQA (||val accuracy - 40.55%; test accuracy - 40.46%||Rosetta-en OCRs; ST-VQA as additional data|
|TextVQA (||val accuracy - 39.40%; test accuracy - 39.01%||Rosetta-en OCRs|
|TextVQA (||m4c.textvqa.ocr_ml||val accuracy - 37.06%||Rosetta-ml OCRs|
|ST-VQA (||val ANLS - 0.472 (accuracy - 38.05%); test ANLS - 0.462||Rosetta-en OCRs|
|OCR-VQA (||val accuracy - 63.52%; test accuracy - 63.87%||Rosetta-en OCRs|
Training and Evaluation
Please follow the MMF documentation for the training and evaluation of the M4C model on each dataset.
- to train the M4C model on the TextVQA training set:
textvqa with other datasets and
projects/m4c/configs/textvqa/defaults.yaml with other config files to train with other datasets and configurations. See the table above. You can also specify a different path to
env.save_dir to save to a location you prefer.)
- To evaluate the pretrained M4C model locally on the a TextVQA's validation set (assuming that the pretrained model that you are evaluating is
As with training, you can replace
checkpoint.resume_zoo according to the setting you want to evaluate.
checkpoint.resume_best=True instead of
checkpoint.resume_zoo=m4c.textvqa.with_stvqa to evaluate your trained snapshots.
Follow checkpointing tutorial to understand more fine-grained details of checkpoint, loading and resuming in MMF
- to generate the EvalAI prediction files for the TextVQA test set (assuming you are evaluating the pretrained model
As before, for generating prediction for other pretrained model for TextVQA, replace
checkpoint.resume_zoo according to the setting you want in the table.
To generate predictions on val set, use
run_type=val instead of
run_type=test. As before, to generate predictions for your checkpoint, use
checkpoint.resume_best=True instead of