Skip to content

Defining Dataset

In Project Config is described how a dataset needs to be defined with a Custom Object in the config. This section details how to define the class_name, args and kwargs defined in the custom object.

Dataset Definition

Azimuth supports the HuggingFace Dataset API. The loading function for the dataset must respect the following contract:

from datasets import DatasetDict

from azimuth.config import AzimuthConfig


def load_your_dataset(azimuth_config: AzimuthConfig, **kwargs) -> DatasetDict:
    ...

Your don't have a HuggingFace Dataset?

If your dataset is not a HuggingFace Dataset, you can convert it easily using the following resources from HuggingFace:

  1. from local files
  2. from in-memory data

We suggest following this HuggingFace tutorial to know more about dataset loading using Huggingface.

Dataset splits

Azimuth expects the train and one of validation or test splits to be available. If both validation and test are available, we will pick the former. The train is not mandatory for Azimuth to run.

Column names and rejection class

Go to the Project Config to see other attributes that should be set along with the dataset.

Example

Using this API, we can load SST2, a sentiment analysis dataset.

Note: in this case, we can omit azimuth_config from the definition because we don't need it.

from datasets import DatasetDict, load_dataset


def load_sst2_dataset(dataset_name: str) -> DatasetDict:
    datasets = load_dataset("glue", dataset_name)
    return DatasetDict(
        {"train": datasets["train"], "validation": datasets["validation"]}
    )
{
  "dataset": {
    "class_name": "loading_resources.load_sst2_dataset",
    "remote": "/azimuth_shr",
    "kwargs": {
      "dataset_name": "sst2"
    }
  },
  "columns": {
    "text_input": "sentence"
  },
  "rejection_class": null
}