TextCaps: a Dataset for Image Captioning with Reading Comprehension

This project page shows how to use M4C-Captioner model from the following paper, released under the MMF:

  • O. Sidorov, R. Hu, M. Rohrbach, A. Singh, TextCaps: a Dataset for Image Captioning with Reading Comprehension. in ECCV, 2020 (PDF)
title={TextCaps: a Dataset for Image Captioningwith Reading Comprehension},
author={Sidorov, Oleksii and Hu, Ronghang and Rohrbach, Marcus and Singh, Amanpreet},
booktitle={European Conference on Computer Vision},

Project Page: https://textvqa.org/textcaps


Install MMF following the installation guide.

This will install all M4C dependencies such as transformers and editdistance and will also compile the python interface for PHOC features.

In addition, it is also necessary to install pycocoevalcap:

# install pycocoevalcap
# use the repo below instead of https://github.com/tylin/coco-caption
# note: you also need to have java on your machine
pip install git+https://github.com/ronghanghu/[email protected]

Note that java is required for pycocoevalcap.

Pretrained M4C-Captioner Models

We release two variants of the M4C-Captioner model trained on the TextCaps dataset, one trained with newer features extracted with maskrcnn-benchmark (defaults), and the other trained with older features extracted with Caffe2 (with_caffe2_feat), which is used in our experimentations in the paper and has higher CIDEr. Please use with_caffe2_feat config and model zoo file if you would like to exactly reproduce the results from our paper.

Config Files (under projects/m4c_captioner/configs/m4c_captioner/textcaps)Pretrained Model KeyMetricsNotes
defaults.yamlm4c_captioner.textcaps.defaultsval CIDEr -- 89.1 (BLEU-4 -- 23.4)newer features extracted with maskrcnn-benchmark
with_caffe2_feat.yamlm4c_captioner.textcaps.with_caffe2_featval CIDEr -- 89.6 (BLEU-4 -- 23.3)older features extracted with Caffe2; used in experiments in the paper

Training and Evaluating M4C-Captioner

Please follow the MMF documentation for the training and evaluation of the M4C-Captioner models.

For example:

1) to train the M4C-Captioner model on the TextCaps training set:

mmf_run datasets=textcaps \
model=m4c_captioner \
config=projects/m4c_captioner/configs/m4c_captioner/textcaps/defaults.yaml \
env.save_dir=./save/m4c_captioner/defaults \

(Replace projects/m4c_captioner/configs/m4c_captioner/textcaps/defaults.yaml with other config files to train with other configurations. See the table above. You can also specify a different path to env.save_dir to save to a location you prefer.)

2) to generate prediction json files for the TextCaps (assuming you are evaluating the pretrained model m4c_captioner.textcaps.defaults):

Generate prediction file on the validation set:

mmf_predict datasets=textcaps \
model=m4c_captioner \
config=projects/m4c_captioner/configs/m4c_captioner/textcaps/defaults.yaml \
env.save_dir=./save/m4c_captioner/defaults \
run_type=val \

Generate prediction file on the test set:

mmf_predict datasets=textcaps \
model=m4c_captioner \
config=projects/m4c_captioner/configs/m4c_captioner/textcaps/defaults.yaml \
env.save_dir=./save/m4c_captioner/defaults \
run_type=test \

As with training, you can replace config and checkpoint.resume_zoo according to the setting you want to evaluate.


Use checkpoint.resume=True AND checkpoint.resume_best=True instead of checkpoint.resume_zoo=m4c_captioner.textcaps.defaults to evaluate your trained snapshots.


Follow checkpointing tutorial to understand more fine-grained details of checkpoint, loading and resuming in MMF

Afterwards, use projects/m4c_captioner/scripts/textcaps_eval.py to evaluate the prediction json file. For example:

# the default data location of MMF (unless you have specified it otherwise)
# this is where MMF datasets are stored
export MMF_DATA_DIR=~/.cache/torch/mmf/data
python projects/m4c_captioner/scripts/textcaps_eval.py \
--set val \
--annotation_file ${MMF_DATA_DIR}/datasets/textcaps/defaults/annotations/imdb_val.npy \

For test set evaluation, please submit to the TextCaps EvalAI server. See https://textvqa.org/textcaps for details.

Last updated on by Ronghang Hu