MoEMLPConfig¶

Module: fast_llm.layers.decoder.mlp.config

Variant of: MLPBaseConfig — select with type: moe

Inherits from: MLPConfig, MLPBaseConfig, BlockWithBiasConfig

Fields¶

activation — architecture

Type: ActivationType Default: None

The MLP intermediate activation type. Default: SiLU for gated MLP, GeLU otherwise.

add_linear_biases — architecture

Type: bool Default: True

Add biases to linear layers. May be overridden for individual layers.

experts — architecture

Type: int Default: 2

Number of MLP experts in a Mixture of Expert (MoE) model

experts_per_token — architecture

Type: int Default: 1

Active experts for each token in a MoE model.

gated — architecture

Type: bool Default: False

Enable gated MLP.

intermediate_size — architecture

Type: int Default: 4096

Hidden dimension of the MLP intermediate state.

layer_1 — architecture

Type: AffineLinearConfig Default: (sub-fields optional)

Configuration for the first MLP layer.

layer_2 — architecture

Type: AffineLinearConfig Default: (sub-fields optional)

Configuration for the second MLP layer.

post_norm — architecture

Type: NormalizationConfig or None Default: None

Optional normalization applied to the MLP output.

pre_norm — architecture

Type: NormalizationConfig or None Default: None

Optional normalization applied to the MLP input.

router — architecture

Type: LinearConfig Default: (sub-fields optional)

Configuration for the MoE router.

router_input_scale — architecture

Type: float Default: 1.0

Constant multiplied into the router input after router_normalization and router_scale. Set to hidden_size ** -0.5 for Gemma-style routing.

router_normalization — architecture

Type: NormalizationConfig or None Default: None

Optional normalization applied to the router input (independent of pre_norm, which goes to experts).

router_per_expert_scale — architecture

Type: OptionalParameterConfig Default: (sub-fields optional)

Optional learnable per-expert scale multiplied into the router scores after top-k selection.

router_scale — architecture

Type: OptionalParameterConfig Default: (sub-fields optional)

Optional learnable per-feature scale applied to the router input after router_normalization.

routing — architecture

Type: RoutingType Default: "aux_loss"

The routing method, i.e., the method used to assign experts to tokens.

shared_experts — architecture

Type: int Default: 0

Number of MLP experts that are shared between all tokens, i.e., always enabled.

auxiliary_loss_coefficient — feature

Type: float Default: 0.01

Scale of the load balancing auxiliary loss for topk routing.

jitter_eps — feature

Type: float Default: 0.0

Regularize the router during training by applying a random multiplicative noise uniform(1-eps, 1+eps) to the logits.

lr_scale — feature

Type: float or None Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.

z_loss_coefficient — feature

Type: float Default: 0.0

Regularize the router during training by applying Z-loss to the logits.

recompute_level — performance

Type: MLPRecomputeLevel Default: "none"

Set which of the MLP intermediate activations will be recomputed during the backward passes. This provides a trade-off between memory and speed.

dropless_dynamic_shape — expert

Type: bool Default: False

Use a dynamic shape for dropless MLP instead of the worst-case value. Reduces memory usage, but increases fragmentation and requires CPU synchronisation. Not recommended.

implementation — expert

Type: MoEImplementation Default: "auto"

MoE forward implementation. auto selects dropless when Triton is available, looped otherwise.

Used in¶

routed in HybridMoEMLPConfig