Skip to content

SamplingConfigBase

Module: fast_llm.data.dataset.config

Fields

maximum_document_lengthcore

Type: int or None    Default: None

Maximum number of tokens in a document. Document exceeding this size will be truncated or dropped depending on truncate_documents.

micro_batch_sizecore

Type: int    Default: 2048

Size of individual micro-batches.

gpufeature

Type: bool    Default: True

Enable fast sampling on GPU. Note that random sampling works differently on GPU, so the sample won't match the CPU equivalent.

shufflefeature

Type: ShufflingType    Default: "epoch"

Shuffling strategy.

truncate_documentsfeature

Type: bool or None    Default: True

If enabled, documents may be truncated while being packed to fit the sequence length.Otherwise, sequences will be padded such that every document lies entirely within a sample (and documents exceeding the sequence length will be skipped altogether).