Skip to content

Fast-LLM

MultiStageConfig

ServiceNow/Fast-LLM

MultiStageConfig¶

Module: fast_llm.engine.multi_stage.config

Inherits from: StageConfig

Fields¶

full_precision_gradients — optional

Type: bool Default: True

Reduce and accumulate gradients in fp32 to improve numerical stability.

store_frozen_weights_in_optimization_precision — optional

Type: bool Default: True

Store frozen weights in full precision even if not needed.Allows preserving the precision for saved checkpoints, at the cost of memory and compute (copy) overheads.

layers_per_stage — performance

Type: float Default: 1.0

Number of layers to include in each Fast LLM stage.

zero_stage — performance

Type: int or None Default: None

The ZeRO stage.

debug_activation_memory — logging

Type: bool Default: False

Log memory usage after each layer.

debug_all_param_gradients — logging

Type: int Default: 0

Log each parameter gradient after reduction.

debug_global_tensors — logging

Type: bool Default: True

Reconstruct global tensors for debug logs (slow, uses lots of memory, does not concat sequential micro-batches).

debug_layer_gradients — logging

Type: int Default: 0

Log the (input) gradients of each layer.

debug_layer_outputs — logging

Type: int Default: 0

Log the output of each layer.

debug_param_gradients — logging

Type: int Default: 0

Log the gradient shard after reduction.

debug_param_init — logging

Type: int Default: 0

Log the parameters after initialization.

debug_param_update — logging

Type: int Default: 0

Log the parameters after update.

debug_tensor_parallel — logging

Type: bool Default: False

Check for tensor-parallel desyncs and log an error if a desync is found. High overhead

num_grad_buffers — expert

Type: int or None Default: None

Number of stage buffer for gradients. Normally set through the ZeRO stage.

num_weight_buffers — expert

Type: int or None Default: None

Number of stage buffer for weights. Normally set through the ZeRO stage.

pipeline_delay — expert

Type: float Default: 0.0

Estimated delay (in steps) for data to go around the pipeline, used to improve pipeline-parallel network overlap. Currently unused

stages_per_pipeline_stage — wip

Type: int Default: 1

Number of Fast LLM stages on each pipeline stage.

Used in¶