Adding a dataset
[Outdated] A new version of this will be uploaded soon
This is a tutorial on how to add a new dataset to MMF.
MMF is agnostic to kind of datasets that can be added to it. On high level, adding a dataset requires 4 main components.
- Dataset Builder
- Default Configuration
- Dataset Class
- Dataset's Metrics
In most of the cases, you should be able to inherit one of the existing datasets for easy integration. Let's start from the dataset builder
Builder creates and returns an instance of :class:
mmf.datasets.base_dataset.BaseDataset which is inherited from
torch.utils.data.dataset.Dataset. Any builder class in MMF needs to be inherited from :class:
mmf.datasets.base_dataset_builder.BaseDatasetBuilder. |BaseDatasetBuilder| requires user to implement following methods after inheriting the class.
Inside this function call super().init("name") where "name" should your dataset's name like "vqa2".
load(self, config, dataset_type, *args, **kwargs)
This function loads the dataset, builds an object of class inheriting |BaseDataset| which contains your dataset logic and returns it.
build(self, config, dataset_type, *args, **kwargs)
This function actually builds the data required for initializing the dataset for the first time. For e.g. if you need to download some data for your dataset, this all should be done inside this function.
Finally, you need to register your dataset builder with a key to registry using
That's it, that's all you require for inheriting |BaseDatasetBuilder|.
Let's write down this using example of CLEVR dataset.
.. code-block:: python
Some things to note about MMF's configuration:
- Each dataset in MMF has its own default configuration which is usually under this structure
taskis the task your dataset belongs to.
- These dataset configurations can be then included by the user in their end config using
- This allows easy multi-tasking and management of configurations and user can also override the default configurations easily in their own config
So, for CLEVR dataset also, we will need to create a default configuration.
The config node is directly passed to your builder which you can then pass to your dataset for any configuration that you need for building your dataset.
Basic structure for a dataset configuration looks like below:
.. code-block:: yaml
Here, is a default configuration for CLEVR needed based on our dataset and builder class above:
.. code-block:: yaml
For processors, check :class:
mmf.datasets.processors to understand how to create a processor and different processors that are already available in MMF.
Next step is to actually build a dataset class which inherits |BaseDataset| so it can interact with PyTorch dataloaders. Follow the steps below to inherit and create your dataset's class.
- Inherit :class:
__init__(self, config, dataset). Call parent's init using
super().__init__("name", config, dataset)where "name" is the string representing the name of your dataset.
__getitem__(self, idx), our replacement for normal
__getitem__(self, idx)you would implement for a torch dataset. This needs to return an object of class :class:Sample.
__len__(self)method, which represents size of your dataset.
- [Optional] Implement
load_item(self, idx)if you need to load something or do something else with data and then call it inside
.. code-block:: python
For your dataset to be compatible out of the box, it is a good practice to also add the metrics your dataset requires. All metrics for now go inside
MMF/modules/metrics.py. All metrics inherit |BaseMetric| and implement a function
calculate with signature
calculate(self, sample_list, model_output, *args, **kwargs) where
sample_list (|SampleList|) is the current batch and
model_output is a dict return by your model for current
sample_list. Normally, you should define the keys you want inside
sample_list. Finally, you should register your metric to registry using
@registry.register_metric('[key]') where '[key]' is the key for your metric. Here is a sample implementation of accuracy metric used in CLEVR dataset:
.. code-block: python
These are the common steps you need to follow when you are adding a dataset to MMF.
.. |BaseDatasetBuilder| replace:: :class:
~mmf.datasets.base_dataset_builder.BaseDatasetBuilder .. |BaseDataset| replace:: :class:
~mmf.datasets.base_dataset.BaseDataset .. |SampleList| replace:: :class:
~mmf.common.sample.SampleList .. |BaseMetric| replace:: :class: