AttentionConfig¶

Module: fast_llm.layers.attention.config

Variant of: MixerConfig — select with type: attention

Inherits from: MixerConfig, BlockWithBiasConfig, BlockConfig

Fields¶

add_linear_biases — architecture

Type: bool Default: True

Add biases to linear layers. May be overridden for individual layers.

causal — architecture

Type: bool Default: True

Use causal attention. Turn this off only for bidirectional attention e.g., in Vision Transformer.

dense_layer — architecture

Type: AffineLinearConfig Default: (sub-fields optional)

Initialization configuration for the dense layer.

head_groups — architecture

Type: int Default: 1

Number of head group for grouped query attention. Set to 1 for multi-query attention, num_attention_heads for multi-head.

head_size — architecture

Type: int Default: 128

Number of key and value channels, i.e., hidden dimension of each attention head.

heads — architecture

Type: int Default: 8

Number of attention heads.

key_layer — architecture

Type: AffineLinearConfig Default: (sub-fields optional)

Configuration for the key layer.

key_norm — architecture

Type: NormalizationConfig or None Default: None

Normalization applied to key vectors before RoPE, per attention head. Set to {type: rms_norm} to enable.

query_layer — architecture

Type: AffineLinearConfig Default: (sub-fields optional)

Configuration for the query layer.

query_norm — architecture

Type: NormalizationConfig or None Default: None

Normalization applied to query vectors before RoPE, per attention head. Set to {type: rms_norm} to enable.

rotary — architecture

Type: RotaryConfig Default: (sub-fields optional)

Configuration for the rotary positional embeddings.

shared_key_value — architecture

Type: bool Default: False

Use one shared key/value projection. The projected key is reused as value before separate K/V norms.

softmax_scale_power — architecture

Type: float Default: 0.5

The scaling power to apply to head_size in the attention calculation. Under Standard Parameterization (SP): default to 0.5. Under muP (if scaling head_size size): use 1. Under muP (if scaling number of heads instead of head_size): use 0.5.

value_layer — architecture

Type: AffineLinearConfig Default: (sub-fields optional)

Configuration for the value layer.

value_norm — architecture

Type: NormalizationConfig or None Default: None

Normalization applied to value projections per head before attention. Use {type: fixed_rms_norm} for a no-weight RMS norm.

dropout — feature

Type: float Default: 0.0

Dropout applied to the attention intermediate states.

implementation — feature

Type: AttentionImplementation Default: "auto"

The implementation to use for the attention layer. auto picks flash when available (bf16/fp16, head_size <= 256, flash-attn installed), otherwise sdpa. sdpa further resolves to sdpa_nested on CUDA without sliding window, and to sdpa_dense otherwise. sdpa_nested and sdpa_dense are explicit overrides; backup is a slow pure-PyTorch fallback.

lr_scale — feature

Type: float or None Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.

use_flash_attention — optional

Type: bool Default: True

Enable Flash Attention if possible.

window_size — feature

Type: int or None Default: None

Size of the attention sliding window. Warning: this parameter is not part of the architecture and must be redefined when loading a pretrained model.