GPTMemmapDatasetPreparatorConfig¶

Module: fast_llm.data.preparation.gpt_memmap.config

Variant of: RunnableConfig — select with type: prepare_gpt_memmap

Variant of: DatasetPreparatorConfig — select with type: gpt_memmap

Inherits from: DatasetPreparatorConfig, RunnableConfig

Fields¶

output_path — core

Type: Path Default: None

Output directory for the processed dataset.

add_bos — optional

Type: bool Default: True

Prepend the tokenizer's BOS token to each document.

add_eos — optional

Type: bool Default: True

Append the tokenizer's EOS token to each document.

dataset — feature

Type: GPTHuggingfaceDatasetConfig Default: (sub-fields optional)

Configuration for the dataset.

distributed — feature

Type: DatasetPreparatorDistributedConfig Default: (sub-fields optional)

Configuration for distributed processing.

documents_per_shard — feature

Type: int Default: 1_000_000

Target number of documents per shard.

image_patches — feature

Type: ImagePreparationConfig Default: (sub-fields optional)

Configuration for the image patches, if enabled.

num_workers — optional

Type: int Default: 1

Number of parallel workers.

splits — optional

Type: dict[str, float] or None Default: None

Split the output dataset into multiple ones (ex, train/valid/test) with the specified ratios. Does not shuffle samples.

tokenizer — feature

Type: TokenizerConfig Default: (sub-fields optional)

Configuration for the tokenizer.