Skip to content

MoEMLPConfig

Module: fast_llm.layers.decoder.mlp.config

Variant of: MLPBaseConfig — select with type: moe

Inherits from: MLPConfig, MLPBaseConfig, BlockWithBiasConfig

Fields

activationcore

Type: ActivationType    Default: None

The MLP intermediate activation type. Default: SiLU for gated MLP, GeLU otherwise.

add_linear_biasesarchitecture

Type: bool    Default: True

Add biases to linear layers. May be overridden for individual layers.

expertsarchitecture

Type: int    Default: 2

Number of MLP experts in a Mixture of Expert (MoE) model

experts_per_tokenarchitecture

Type: int    Default: 1

Active experts for each token in a MoE model.

gatedarchitecture

Type: bool    Default: False

Enable gated MLP.

intermediate_sizearchitecture

Type: int    Default: 4096

Hidden dimension of the MLP intermediate state.

layer_1architecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the first MLP layer.

layer_2architecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the second MLP layer.

routingarchitecture

Type: RoutingType    Default: "aux_loss"

The routing method, i.e., the method used to assign experts to tokens.

shared_expertsarchitecture

Type: int    Default: 0

Number of MLP experts that are shared between all tokens, i.e., always enabled.

auxiliary_loss_coefficientfeature

Type: float    Default: 0.01

Scale of the load balancing auxiliary loss for topk routing.

jitter_epsfeature

Type: float    Default: 0.0

Regularize the router during training by applying a random multiplicative noise uniform(1-eps, 1+eps) to the logits.

lr_scalefeature

Type: float or None    Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.

routerfeature

Type: LinearConfig    Default: (sub-fields optional)

Configuration for the MoE router.

z_loss_coefficientfeature

Type: float    Default: 0.0

Regularize the router during training by applying Z-loss to the logits.

recompute_levelperformance

Type: MLPRecomputeLevel    Default: "none"

Set which of the MLP intermediate activations will be recomputed during the backward passes. This provides a trade-off between memory and speed.

droplessexpert

Type: bool    Default: True

Evaluate all the experts at once using dropless MoE.

dropless_dynamic_shapeexpert

Type: bool    Default: False

Use a dynamic shape for dropless MLP instead of the worst-case value. Reduces memory usage, but increases fragmentation and requires CPU synchronisation. Not recommended.