SamplingConfig¶
Module: fast_llm.data.dataset.config
Inherits from: SamplingConfigBase
Fields¶
maximum_document_length—core-
Type:
intorNoneDefault:NoneMaximum number of tokens in a document. Document exceeding this size will be truncated or dropped depending on
truncate_documents. micro_batch_size—core-
Type:
intDefault:2048Size of individual micro-batches.
gpu—feature-
Type:
boolDefault:TrueEnable fast sampling on GPU. Note that random sampling works differently on GPU, so the sample won't match the CPU equivalent.
shuffle—feature-
Type:
ShufflingTypeDefault:"epoch"Shuffling strategy.
truncate_documents—feature-
Type:
boolorNoneDefault:TrueIf enabled, documents may be truncated while being packed to fit the sequence length.Otherwise, sequences will be padded such that every document lies entirely within a sample (and documents exceeding the sequence length will be skipped altogether).
cache_directory- Type:
PathorNoneDefault:None dataset_name- Type:
strDefault:"dataset" predicted_tokens- Type:
intDefault:1 rank- Type:
intDefault:0 token_cumsum_rate—performance-
Type:
intDefault:10Sampling interval for the token cumulative sum index. A smaller value reduces per-sample seek time at the cost of a larger index.
world_size- Type:
intDefault:1