GPTHuggingfaceDatasetConfig¶

Module: fast_llm.data.preparation.gpt_memmap.config

Fields¶

path — core

Type: str or Path Default: None

Name or path of the dataset.

config_name — optional

Type: str or None Default: None

Specific configuration name for the dataset.

data_directory — optional

Type: str or None Default: None

data_dir argument passed to load_dataset

data_files — optional

Type: str or list[str] or None Default: None

data_files argument passed to load_dataset

data_type — optional

Type: DataType or None Default: None

Data type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.

disable_disk_space_check — optional

Type: bool Default: False

Disable disk space check. Useful for environments where disk space is not accurately reported.

load_from_disk — feature

Type: bool Default: False

Use the load_from_disk method for datasets saved with save_to_disk.

source_schema — optional

Type: LanguageModelSourceConfig Default: (sub-fields optional)

Configuration for the data source.

split — optional

Type: str Default: "train"

Split of the dataset to use.

trust_remote_code — optional

Type: bool Default: False

Allow this dataset to load custom Python code shipped with its repository. Has no effect unless --trust-remote-code is also passed on the command line; both are required so a config file alone cannot enable remote-code execution.

Used in¶

dataset in GPTMemmapDatasetPreparatorConfig