Skip to content

GPTHuggingfaceDatasetConfig

Module: fast_llm.data.preparation.gpt_memmap.config

Fields

pathcore

Type: str or Path    Default: None

Name or path of the dataset.

config_nameoptional

Type: str or None    Default: None

Specific configuration name for the dataset.

data_directoryoptional

Type: str or None    Default: None

data_dir argument passed to load_dataset

data_filesoptional

Type: str or list[str] or None    Default: None

data_files argument passed to load_dataset

data_typeoptional

Type: DataType or None    Default: None

Data type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.

disable_disk_space_checkoptional

Type: bool    Default: False

Disable disk space check. Useful for environments where disk space is not accurately reported.

load_from_diskfeature

Type: bool    Default: False

Use the load_from_disk method for datasets saved with save_to_disk.

source_schemaoptional

Type: LanguageModelSourceConfig    Default: (sub-fields optional)

Configuration for the data source.

splitoptional

Type: str    Default: "train"

Split of the dataset to use.

trust_remote_codeoptional

Type: bool    Default: False

Trust remote code when downloading the dataset.

Used in