AttentionConfig¶
Module: fast_llm.layers.attention.config
Variant of: MixerConfig — select with type: attention
Inherits from: MixerConfig, BlockWithBiasConfig, BlockConfig
Fields¶
add_linear_biases—architecture-
Type:
boolDefault:TrueAdd biases to linear layers. May be overridden for individual layers.
causal—architecture-
Type:
boolDefault:TrueUse causal attention. Turn this off only for bidirectional attention e.g., in Vision Transformer.
head_groups—architecture-
Type:
intDefault:1Number of head group for grouped query attention. Set to 1 for multi-query attention,
num_attention_headsfor multi-head. head_size—architecture-
Type:
intDefault:128Number of key and value channels, i.e., hidden dimension of each attention head.
heads—architecture-
Type:
intDefault:8Number of attention heads.
key_layer—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the key layer.
query_layer—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the query layer.
rotary—architecture-
Type: RotaryConfig Default: (sub-fields optional)
Configuration for the rotary positional embeddings.
value_layer—architecture-
Type: AffineLinearConfig Default: (sub-fields optional)
Configuration for the value layer.
dense_layer—feature-
Type: AffineLinearConfig Default: (sub-fields optional)
Initialization configuration for the dense layer.
dropout—feature-
Type:
floatDefault:0.0Dropout applied to the attention intermediate states.
implementation—feature-
Type:
AttentionImplementationDefault:"auto"The implementation to use for the attention layer. Default:
flashif supported, otherwisebackup. lr_scale—feature-
Type:
floatorNoneDefault:NoneScaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.
use_flash_attention—optional-
Type:
boolDefault:TrueEnable Flash Attention if possible.
window_size—feature-
Type:
intorNoneDefault:NoneSize of the attention sliding window. Warning: this parameter is not part of the architecture and must be redefined when loading a pretrained model.
softmax_scale_power-
Type:
floatDefault:0.5The scaling power to apply to head_size in the attention calculation. Under Standard Parameterization (SP): default to 0.5. Under muP (if scaling head_size size): use 1. Under muP (if scaling number of heads instead of head_size): use 0.5.