GPTMemmapDatasetPreparatorConfig¶
Module: fast_llm.data.preparation.gpt_memmap.config
Variant of: RunnableConfig — select with type: prepare_gpt_memmap
Variant of: DatasetPreparatorConfig — select with type: gpt_memmap
Inherits from: DatasetPreparatorConfig, RunnableConfig
Fields¶
output_path—core-
Type:
PathDefault:NoneOutput directory for the processed dataset.
dataset—feature-
Type: GPTHuggingfaceDatasetConfig Default: (sub-fields optional)
Configuration for the dataset.
distributed—feature-
Type: DatasetPreparatorDistributedConfig Default: (sub-fields optional)
Configuration for distributed processing.
documents_per_shard—feature-
Type:
intDefault:1_000_000Target number of documents per shard.
image_patches—feature-
Type: ImagePreparationConfig Default: (sub-fields optional)
Configuration for the image patches, if enabled.
num_workers—optional-
Type:
intDefault:1Number of parallel workers.
splits—optional-
Type: dict[
str,float] orNoneDefault:NoneSplit the output dataset into multiple ones (ex, train/valid/test) with the specified ratios. Does not shuffle samples.
tokenizer—feature-
Type: TokenizerConfig Default: (sub-fields optional)
Configuration for the tokenizer.