Skip to content

SamplingConfig

Module: fast_llm.data.dataset.config

Inherits from: SamplingConfigBase

Fields

maximum_document_lengthcore

Type: int or None    Default: None

Maximum number of tokens in a document. Document exceeding this size will be truncated or dropped depending on truncate_documents.

micro_batch_sizecore

Type: int    Default: 2048

Size of individual micro-batches.

gpufeature

Type: bool    Default: True

Enable fast sampling on GPU. Note that random sampling works differently on GPU, so the sample won't match the CPU equivalent.

shufflefeature

Type: ShufflingType    Default: "epoch"

Shuffling strategy.

truncate_documentsfeature

Type: bool or None    Default: True

If enabled, documents may be truncated while being packed to fit the sequence length.Otherwise, sequences will be padded such that every document lies entirely within a sample (and documents exceeding the sequence length will be skipped altogether).

cache_directory
Type: Path or None    Default: None
dataset_name
Type: str    Default: "dataset"
predicted_tokens
Type: int    Default: 1
rank
Type: int    Default: 0
token_cumsum_rateperformance

Type: int    Default: 10

Sampling interval for the token cumulative sum index. A smaller value reduces per-sample seek time at the cost of a larger index.

world_size
Type: int    Default: 1