Defining Dataset
In Project Config is described how a dataset needs to
be defined with a Custom Object in the config. This section details how
to define the class_name
, args
and kwargs
defined in the custom object.
Dataset Definition
Azimuth supports the HuggingFace Dataset API. The loading function for the dataset must respect the following contract:
from datasets import DatasetDict
from azimuth.config import AzimuthConfig
def load_your_dataset(azimuth_config: AzimuthConfig, **kwargs) -> DatasetDict:
...
Your don't have a HuggingFace Dataset
?
If your dataset is not a HuggingFace Dataset
, you can convert it easily using the following
resources from HuggingFace:
We suggest following this HuggingFace tutorial to know more about dataset loading using Huggingface.
Dataset splits
Azimuth expects either train
, validation
or test
splits to be available.
- If both
validation
andtest
are available, we will pick the former as theevaluation
split. - The app can load a
train
split only, anevaluation
split only, or both.
Column names and rejection class
Go to the Project Config to see other attributes that should be set along with the dataset.
Examples
Using this API, we can load SST2, a sentiment analysis dataset.
We can also load a CSV file.
from datasets import DatasetDict, load_dataset
def load_csv(train_path=None, validation_path=None) -> DatasetDict:
data_files = dict()
if train_path:
data_files["train"] = train_path
if validation_path:
data_files["validation"] = validation_path
ds_dict = load_dataset(path="csv", data_files=data_files)
return ds_dict
Note: in both cases, we can omit azimuth_config
from the definition because we don't need it.
For more examples, users can refer to azimuth_shr/loading_resources.py
in the repo.