Skip to content

MoEMLPConfig

Module: fast_llm.layers.decoder.mlp.config

Variant of: MLPBaseConfig — select with type: moe

Inherits from: MLPConfig, MLPBaseConfig, BlockWithBiasConfig

Fields

activationcore

Type: ActivationType    Default: None

The MLP intermediate activation type. Default: SiLU for gated MLP, GeLU otherwise.

add_linear_biasesarchitecture

Type: bool    Default: True

Add biases to linear layers. May be overridden for individual layers.

expertsarchitecture

Type: int    Default: 2

Number of MLP experts in a Mixture of Expert (MoE) model

experts_per_tokenarchitecture

Type: int    Default: 1

Active experts for each token in a MoE model.

gatedarchitecture

Type: bool    Default: False

Enable gated MLP.

intermediate_sizearchitecture

Type: int    Default: 4096

Hidden dimension of the MLP intermediate state.

layer_1architecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the first MLP layer.

layer_2architecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the second MLP layer.

post_normarchitecture

Type: NormalizationConfig or None    Default: None

Optional normalization applied to the MLP output.

pre_normarchitecture

Type: NormalizationConfig or None    Default: None

Optional normalization applied to the MLP input.

router_input_scalearchitecture

Type: float    Default: 1.0

Constant multiplied into the router input after router_normalization and router_scale. Set to hidden_size ** -0.5 for Gemma-style routing.

router_normalizationarchitecture

Type: NormalizationConfig or None    Default: None

Optional normalization applied to the router input (independent of pre_norm, which goes to experts).

router_per_expert_scalearchitecture

Type: OptionalParameterConfig    Default: (sub-fields optional)

Optional learnable per-expert scale multiplied into the router scores after top-k selection.

router_scalearchitecture

Type: OptionalParameterConfig    Default: (sub-fields optional)

Optional learnable per-feature scale applied to the router input after router_normalization.

routingarchitecture

Type: RoutingType    Default: "aux_loss"

The routing method, i.e., the method used to assign experts to tokens.

shared_expertsarchitecture

Type: int    Default: 0

Number of MLP experts that are shared between all tokens, i.e., always enabled.

auxiliary_loss_coefficientfeature

Type: float    Default: 0.01

Scale of the load balancing auxiliary loss for topk routing.

jitter_epsfeature

Type: float    Default: 0.0

Regularize the router during training by applying a random multiplicative noise uniform(1-eps, 1+eps) to the logits.

lr_scalefeature

Type: float or None    Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.

routerfeature

Type: LinearConfig    Default: (sub-fields optional)

Configuration for the MoE router.

z_loss_coefficientfeature

Type: float    Default: 0.0

Regularize the router during training by applying Z-loss to the logits.

recompute_levelperformance

Type: MLPRecomputeLevel    Default: "none"

Set which of the MLP intermediate activations will be recomputed during the backward passes. This provides a trade-off between memory and speed.

dropless_dynamic_shapeexpert

Type: bool    Default: False

Use a dynamic shape for dropless MLP instead of the worst-case value. Reduces memory usage, but increases fragmentation and requires CPU synchronisation. Not recommended.

implementationexpert

Type: MoEImplementation    Default: "auto"

MoE forward implementation. auto selects dropless when Triton is available, looped otherwise.

Used in