Skip to content

AttentionConfig

Module: fast_llm.layers.attention.config

Variant of: MixerConfig — select with type: attention

Inherits from: MixerConfig, BlockWithBiasConfig, BlockConfig

Fields

add_linear_biasesarchitecture

Type: bool    Default: True

Add biases to linear layers. May be overridden for individual layers.

causalarchitecture

Type: bool    Default: True

Use causal attention. Turn this off only for bidirectional attention e.g., in Vision Transformer.

head_groupsarchitecture

Type: int    Default: 1

Number of head group for grouped query attention. Set to 1 for multi-query attention, num_attention_heads for multi-head.

head_sizearchitecture

Type: int    Default: 128

Number of key and value channels, i.e., hidden dimension of each attention head.

headsarchitecture

Type: int    Default: 8

Number of attention heads.

key_layerarchitecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the key layer.

query_layerarchitecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the query layer.

rotaryarchitecture

Type: RotaryConfig    Default: (sub-fields optional)

Configuration for the rotary positional embeddings.

value_layerarchitecture

Type: AffineLinearConfig    Default: (sub-fields optional)

Configuration for the value layer.

dense_layerfeature

Type: AffineLinearConfig    Default: (sub-fields optional)

Initialization configuration for the dense layer.

dropoutfeature

Type: float    Default: 0.0

Dropout applied to the attention intermediate states.

implementationfeature

Type: AttentionImplementation    Default: "auto"

The implementation to use for the attention layer. Default: flash if supported, otherwise backup.

lr_scalefeature

Type: float or None    Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.

use_flash_attentionoptional

Type: bool    Default: True

Enable Flash Attention if possible.

window_sizefeature

Type: int or None    Default: None

Size of the attention sliding window. Warning: this parameter is not part of the architecture and must be redefined when loading a pretrained model.

softmax_scale_power

Type: float    Default: 0.5

The scaling power to apply to head_size in the attention calculation. Under Standard Parameterization (SP): default to 0.5. Under muP (if scaling head_size size): use 1. Under muP (if scaling number of heads instead of head_size): use 0.5.