SamplingConfigBase¶

Module: fast_llm.data.dataset.config

Fields¶

maximum_document_length — core

Type: int or None Default: None

Maximum number of tokens in a document. Document exceeding this size will be truncated or dropped depending on truncate_documents.

micro_batch_size — core

Type: int Default: 2048

Size of individual micro-batches.

gpu — feature

Type: bool Default: True

Enable fast sampling on GPU. Note that random sampling works differently on GPU, so the sample won't match the CPU equivalent.

shuffle — feature

Type: ShufflingType Default: "epoch"

Shuffling strategy.

truncate_documents — feature

Type: bool or None Default: True

If enabled, documents may be truncated while being packed to fit the sequence length.Otherwise, sequences will be padded such that every document lies entirely within a sample (and documents exceeding the sequence length will be skipped altogether).