Defining Dataset
In Project Config is described how a dataset needs to
be defined with a Custom Object in the config. This section details how
to define the class_name
, args
and kwargs
defined in the custom object.
Dataset Definition
Azimuth supports the HuggingFace Dataset API. The loading function for the dataset must respect the following contract:
from datasets import DatasetDict
from azimuth.config import AzimuthConfig
def load_your_dataset(azimuth_config: AzimuthConfig, **kwargs) -> DatasetDict:
...
Your don't have a HuggingFace Dataset
?
If your dataset is not a HuggingFace Dataset
, you can convert it easily using the following
resources from HuggingFace:
We suggest following this HuggingFace tutorial to know more about dataset loading using Huggingface.
Dataset splits
Azimuth expects the train
and one of validation
or test
splits to be available. If
both validation
and test
are available, we will pick the former. The train
is not mandatory for Azimuth to run.
Column names and rejection class
Go to the Project Config to see other attributes that should be set along with the dataset.
Example
Using this API, we can load SST2, a sentiment analysis dataset.
Note: in this case, we can omit azimuth_config
from the definition because we don't need it.