DecoderBlockConfig¶

Module: fast_llm.layers.decoder.config

Variant of: BlockConfig — select with type: decoder

Inherits from: BlockConfig, ModuleConfig

Fields¶

mixer — architecture

Type: MixerConfig Default: (sub-fields optional)

Configuration for the attention/mixer layer.

mlp — architecture

Type: MLPBaseConfig Default: (sub-fields optional)

Configuration for the feedforward (MLP) layer.

normalization — architecture

Type: NormalizationConfig Default: (sub-fields optional)

Configuration for the block normalization layers. Used as default for pre_mixer_normalization and pre_mlp_normalization when not set.

output_scale — architecture

Type: OptionalParameterConfig Default: (sub-fields optional)

Optional learnable scalar multiplied into the block output (after the MLP residual add).

post_mixer_normalization — architecture

Type: NormalizationConfig or None Default: None

Optional normalization applied to the mixer output before the residual add. Set to {type: rms_norm} to enable.

post_mlp_normalization — architecture

Type: NormalizationConfig or None Default: None

Optional normalization applied to the MLP output before the residual add. Set to {type: rms_norm} to enable.

pre_mixer_normalization — architecture

Type: NormalizationConfig or None Default: None

Normalization applied to the residual before the mixer. Defaults to normalization when not set.

pre_mlp_normalization — architecture

Type: NormalizationConfig or None Default: None

Normalization applied to the residual before the MLP. Defaults to normalization when not set. Set to {type: none} to disable independently of the pre-mixer norm.

distillation_loss_weight — feature

Type: float Default: 1.0

Weight for the scale the activation distillation loss.

distillation_model — feature

Type: str or None Default: None

Name of the reference model to use for activation-level distillation.

dropout — feature

Type: float Default: 0.0

Dropout applied to the residual connections.

lr_scale — feature

Type: float or None Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.