MoEMLPConfig¶
Module: fast_llm.layers.decoder.mlp.config
Variant of: MLPBaseConfig — select with type: moe
Inherits from: MLPConfig, MLPBaseConfig, BlockWithBiasConfig
Fields¶
activation—core-
Type:
ActivationTypeDefault:NoneThe MLP intermediate activation type. Default: SiLU for gated MLP, GeLU otherwise.
add_linear_biases—architecture-
Type:
boolDefault:TrueAdd biases to linear layers. May be overridden for individual layers.
experts—architecture-
Type:
intDefault:2Number of MLP experts in a Mixture of Expert (MoE) model
experts_per_token—architecture-
Type:
intDefault:1Active experts for each token in a MoE model.
gated—architecture-
Type:
boolDefault:FalseEnable gated MLP.
intermediate_size—architecture-
Type:
intDefault:4096Hidden dimension of the MLP intermediate state.
layer_1—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the first MLP layer.
layer_2—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the second MLP layer.
post_norm—architecture-
Type: NormalizationConfig or
NoneDefault:NoneOptional normalization applied to the MLP output.
pre_norm—architecture-
Type: NormalizationConfig or
NoneDefault:NoneOptional normalization applied to the MLP input.
router_input_scale—architecture-
Type:
floatDefault:1.0Constant multiplied into the router input after
router_normalizationandrouter_scale. Set tohidden_size ** -0.5for Gemma-style routing. router_normalization—architecture-
Type: NormalizationConfig or
NoneDefault:NoneOptional normalization applied to the router input (independent of
pre_norm, which goes to experts). router_per_expert_scale—architecture-
Type: OptionalParameterConfig Default: (sub-fields optional)
Optional learnable per-expert scale multiplied into the router scores after top-k selection.
router_scale—architecture-
Type: OptionalParameterConfig Default: (sub-fields optional)
Optional learnable per-feature scale applied to the router input after
router_normalization. routing—architecture-
Type:
RoutingTypeDefault:"aux_loss"The routing method, i.e., the method used to assign experts to tokens.
shared_experts—architecture-
Type:
intDefault:0Number of MLP experts that are shared between all tokens, i.e., always enabled.
auxiliary_loss_coefficient—feature-
Type:
floatDefault:0.01Scale of the load balancing auxiliary loss for topk routing.
jitter_eps—feature-
Type:
floatDefault:0.0Regularize the router during training by applying a random multiplicative noise
uniform(1-eps, 1+eps)to the logits. lr_scale—feature-
Type:
floatorNoneDefault:NoneScaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.
router—feature-
Type: LinearConfig Default: (sub-fields optional)
Configuration for the MoE router.
z_loss_coefficient—feature-
Type:
floatDefault:0.0Regularize the router during training by applying Z-loss to the logits.
recompute_level—performance-
Type:
MLPRecomputeLevelDefault:"none"Set which of the MLP intermediate activations will be recomputed during the backward passes. This provides a trade-off between memory and speed.
dropless_dynamic_shape—expert-
Type:
boolDefault:FalseUse a dynamic shape for dropless MLP instead of the worst-case value. Reduces memory usage, but increases fragmentation and requires CPU synchronisation. Not recommended.
implementation—expert-
Type:
MoEImplementationDefault:"auto"MoE forward implementation.
autoselects dropless when Triton is available, looped otherwise.