Adding a dataset
[Outdated] A new version of this will be uploaded soon
#
MMFThis is a tutorial on how to add a new dataset to MMF.
MMF is agnostic to kind of datasets that can be added to it. On high level, adding a dataset requires 4 main components.
- Dataset Builder
- Default Configuration
- Dataset Class
- Dataset's Metrics
In most of the cases, you should be able to inherit one of the existing datasets for easy integration. Let's start from the dataset builder
#
Dataset BuilderBuilder creates and returns an instance of :class:mmf.datasets.base_dataset.BaseDataset
which is inherited from torch.utils.data.dataset.Dataset
. Any builder class in MMF needs to be inherited from :class:mmf.datasets.base_dataset_builder.BaseDatasetBuilder
. |BaseDatasetBuilder| requires user to implement following methods after inheriting the class.
__init__(self):
Inside this function call super().init("name") where "name" should your dataset's name like "vqa2".
load(self, config, dataset_type, *args, **kwargs)
This function loads the dataset, builds an object of class inheriting |BaseDataset| which contains your dataset logic and returns it.
build(self, config, dataset_type, *args, **kwargs)
This function actually builds the data required for initializing the dataset for the first time. For e.g. if you need to download some data for your dataset, this all should be done inside this function.
Finally, you need to register your dataset builder with a key to registry using mmf.common.registry.registry.register_builder("key")
.
That's it, that's all you require for inheriting |BaseDatasetBuilder|.
Let's write down this using example of CLEVR dataset.
.. code-block:: python
#
Default ConfigurationSome things to note about MMF's configuration:
- Each dataset in MMF has its own default configuration which is usually under this structure
mmf/common/defaults/configs/datasets/[task]/[dataset].yaml
wheretask
is the task your dataset belongs to. - These dataset configurations can be then included by the user in their end config using
includes
directive - This allows easy multi-tasking and management of configurations and user can also override the default configurations easily in their own config
So, for CLEVR dataset also, we will need to create a default configuration.
The config node is directly passed to your builder which you can then pass to your dataset for any configuration that you need for building your dataset.
Basic structure for a dataset configuration looks like below:
.. code-block:: yaml
.. note:
Here, is a default configuration for CLEVR needed based on our dataset and builder class above:
.. code-block:: yaml
For processors, check :class:mmf.datasets.processors
to understand how to create a processor and different processors that are already available in MMF.
#
Dataset ClassNext step is to actually build a dataset class which inherits |BaseDataset| so it can interact with PyTorch dataloaders. Follow the steps below to inherit and create your dataset's class.
- Inherit :class:
mmf.datasets.base_dataset.BaseDataset
- Implement
__init__(self, config, dataset)
. Call parent's init usingsuper().__init__("name", config, dataset)
where "name" is the string representing the name of your dataset. - Implement
__getitem__(self, idx)
, our replacement for normal__getitem__(self, idx)
you would implement for a torch dataset. This needs to return an object of class :class:Sample. - Implement
__len__(self)
method, which represents size of your dataset. - [Optional] Implement
load_item(self, idx)
if you need to load something or do something else with data and then call it inside__getitem__
.
.. note:
.. code-block:: python
#
MetricsFor your dataset to be compatible out of the box, it is a good practice to also add the metrics your dataset requires. All metrics for now go inside MMF/modules/metrics.py
. All metrics inherit |BaseMetric| and implement a function calculate
with signature calculate(self, sample_list, model_output, *args, **kwargs)
where sample_list
(|SampleList|) is the current batch and model_output
is a dict return by your model for current sample_list
. Normally, you should define the keys you want inside model_output
and sample_list
. Finally, you should register your metric to registry using @registry.register_metric('[key]')
where '[key]' is the key for your metric. Here is a sample implementation of accuracy metric used in CLEVR dataset:
.. code-block: python
These are the common steps you need to follow when you are adding a dataset to MMF.
.. |BaseDatasetBuilder| replace:: :class:~mmf.datasets.base_dataset_builder.BaseDatasetBuilder
.. |BaseDataset| replace:: :class:~mmf.datasets.base_dataset.BaseDataset
.. |SampleList| replace:: :class:~mmf.common.sample.SampleList
.. |BaseMetric| replace:: :class:~mmf.modules.metrics.BaseMetric