DecoderBlockConfig¶
Module: fast_llm.layers.decoder.config
Variant of: BlockConfig — select with type: decoder
Inherits from: BlockConfig, ModuleConfig
Fields¶
mixer—architecture-
Type: MixerConfig Default: (sub-fields optional)
Configuration for the attention/mixer layer.
mlp—architecture-
Type: MLPBaseConfig Default: (sub-fields optional)
Configuration for the feedforward (MLP) layer.
normalization—architecture-
Type: NormalizationConfig Default: (sub-fields optional)
Configuration for the block normalization layers. Used as default for
pre_mixer_normalizationandpre_mlp_normalizationwhen not set. output_scale—architecture-
Type: OptionalParameterConfig Default: (sub-fields optional)
Optional learnable scalar multiplied into the block output (after the MLP residual add).
post_mixer_normalization—architecture-
Type: NormalizationConfig or
NoneDefault:NoneOptional normalization applied to the mixer output before the residual add. Set to
{type: rms_norm}to enable. post_mlp_normalization—architecture-
Type: NormalizationConfig or
NoneDefault:NoneOptional normalization applied to the MLP output before the residual add. Set to
{type: rms_norm}to enable. pre_mixer_normalization—architecture-
Type: NormalizationConfig or
NoneDefault:NoneNormalization applied to the residual before the mixer. Defaults to
normalizationwhen not set. pre_mlp_normalization—architecture-
Type: NormalizationConfig or
NoneDefault:NoneNormalization applied to the residual before the MLP. Defaults to
normalizationwhen not set. Set to{type: none}to disable independently of the pre-mixer norm. distillation_loss_weight—feature-
Type:
floatDefault:1.0Weight for the scale the activation distillation loss.
distillation_model—feature-
Type:
strorNoneDefault:NoneName of the reference model to use for activation-level distillation.
dropout—feature-
Type:
floatDefault:0.0Dropout applied to the residual connections.
lr_scale—feature-
Type:
floatorNoneDefault:NoneScaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.