Skip to content

GPTMemmapDatasetPreparatorConfig

Module: fast_llm.data.preparation.gpt_memmap.config

Variant of: RunnableConfig — select with type: prepare_gpt_memmap

Variant of: DatasetPreparatorConfig — select with type: gpt_memmap

Inherits from: DatasetPreparatorConfig, RunnableConfig

Fields

output_pathcore

Type: Path    Default: None

Output directory for the processed dataset.

datasetfeature

Type: GPTHuggingfaceDatasetConfig    Default: (sub-fields optional)

Configuration for the dataset.

distributedfeature

Type: DatasetPreparatorDistributedConfig    Default: (sub-fields optional)

Configuration for distributed processing.

documents_per_shardfeature

Type: int    Default: 1_000_000

Target number of documents per shard.

image_patchesfeature

Type: ImagePreparationConfig    Default: (sub-fields optional)

Configuration for the image patches, if enabled.

num_workersoptional

Type: int    Default: 1

Number of parallel workers.

splitsoptional

Type: dict[str, float] or None    Default: None

Split the output dataset into multiple ones (ex, train/valid/test) with the specified ratios. Does not shuffle samples.

tokenizerfeature

Type: TokenizerConfig    Default: (sub-fields optional)

Configuration for the tokenizer.