Skip to content

DecoderBlockConfig

Module: fast_llm.layers.decoder.config

Variant of: BlockConfig — select with type: decoder

Inherits from: BlockConfig, ModuleConfig

Fields

mixerarchitecture

Type: MixerConfig    Default: (sub-fields optional)

Configuration for the attention/mixer layer.

mlparchitecture

Type: MLPBaseConfig    Default: (sub-fields optional)

Configuration for the feedforward (MLP) layer.

normalizationarchitecture

Type: NormalizationConfig    Default: (sub-fields optional)

Configuration for the block normalization layers. Used as default for pre_mixer_normalization and pre_mlp_normalization when not set.

output_scalearchitecture

Type: OptionalParameterConfig    Default: (sub-fields optional)

Optional learnable scalar multiplied into the block output (after the MLP residual add).

post_mixer_normalizationarchitecture

Type: NormalizationConfig or None    Default: None

Optional normalization applied to the mixer output before the residual add. Set to {type: rms_norm} to enable.

post_mlp_normalizationarchitecture

Type: NormalizationConfig or None    Default: None

Optional normalization applied to the MLP output before the residual add. Set to {type: rms_norm} to enable.

pre_mixer_normalizationarchitecture

Type: NormalizationConfig or None    Default: None

Normalization applied to the residual before the mixer. Defaults to normalization when not set.

pre_mlp_normalizationarchitecture

Type: NormalizationConfig or None    Default: None

Normalization applied to the residual before the MLP. Defaults to normalization when not set. Set to {type: none} to disable independently of the pre-mixer norm.

distillation_loss_weightfeature

Type: float    Default: 1.0

Weight for the scale the activation distillation loss.

distillation_modelfeature

Type: str or None    Default: None

Name of the reference model to use for activation-level distillation.

dropoutfeature

Type: float    Default: 0.0

Dropout applied to the residual connections.

lr_scalefeature

Type: float or None    Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.