SamplingConfig¶

Module: fast_llm.data.dataset.config

Fields¶

maximum_document_length — core

Type: int or None Default: None

Maximum number of tokens in a document. Document exceeding this size will be truncated or dropped depending on truncate_documents.

micro_batch_size — core

Type: int Default: 2048

Size of individual micro-batches.

gpu — feature

Type: bool Default: True

Enable fast sampling on GPU. Note that random sampling works differently on GPU, so the sample won't match the CPU equivalent.

shuffle — feature

Type: ShufflingType Default: "epoch"

Shuffling strategy.

truncate_documents — feature

Type: bool or None Default: True

If enabled, documents may be truncated while being packed to fit the sequence length.Otherwise, sequences will be padded such that every document lies entirely within a sample (and documents exceeding the sequence length will be skipped altogether).

cache_directory

Type: Path or None Default: None

dataset_name

Type: str Default: "dataset"

predicted_tokens

Type: int Default: 1

rank

Type: int Default: 0

token_cumsum_rate — performance

Type: int Default: 10

Sampling interval for the token cumulative sum index. A smaller value reduces per-sample seek time at the cost of a larger index.

world_size

Type: int Default: 1