MoEMLPConfig¶
Module: fast_llm.layers.decoder.mlp.config
Variant of: MLPBaseConfig — select with type: moe
Inherits from: MLPConfig, MLPBaseConfig, BlockWithBiasConfig
Fields¶
activation—core-
Type:
ActivationTypeDefault:NoneThe MLP intermediate activation type. Default: SiLU for gated MLP, GeLU otherwise.
add_linear_biases—architecture-
Type:
boolDefault:TrueAdd biases to linear layers. May be overridden for individual layers.
experts—architecture-
Type:
intDefault:2Number of MLP experts in a Mixture of Expert (MoE) model
experts_per_token—architecture-
Type:
intDefault:1Active experts for each token in a MoE model.
gated—architecture-
Type:
boolDefault:FalseEnable gated MLP.
intermediate_size—architecture-
Type:
intDefault:4096Hidden dimension of the MLP intermediate state.
layer_1—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the first MLP layer.
layer_2—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the second MLP layer.
routing—architecture-
Type:
RoutingTypeDefault:"aux_loss"The routing method, i.e., the method used to assign experts to tokens.
shared_experts—architecture-
Type:
intDefault:0Number of MLP experts that are shared between all tokens, i.e., always enabled.
auxiliary_loss_coefficient—feature-
Type:
floatDefault:0.01Scale of the load balancing auxiliary loss for topk routing.
jitter_eps—feature-
Type:
floatDefault:0.0Regularize the router during training by applying a random multiplicative noise
uniform(1-eps, 1+eps)to the logits. lr_scale—feature-
Type:
floatorNoneDefault:NoneScaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.
router—feature-
Type: LinearConfig Default: (sub-fields optional)
Configuration for the MoE router.
z_loss_coefficient—feature-
Type:
floatDefault:0.0Regularize the router during training by applying Z-loss to the logits.
recompute_level—performance-
Type:
MLPRecomputeLevelDefault:"none"Set which of the MLP intermediate activations will be recomputed during the backward passes. This provides a trade-off between memory and speed.
dropless—expert-
Type:
boolDefault:TrueEvaluate all the experts at once using dropless MoE.
dropless_dynamic_shape—expert-
Type:
boolDefault:FalseUse a dynamic shape for dropless MLP instead of the worst-case value. Reduces memory usage, but increases fragmentation and requires CPU synchronisation. Not recommended.