MultiModalBaseModelConfig¶

Module: fast_llm.models.multimodal.config

Inherits from: VisionMultiModalModelConfig, GPTBaseModelConfig, LanguageModelConfig

Fields¶

decoder — architecture

Type: BlockSequenceConfig Default: (sub-fields optional)

Configuration for the language model decoder.

embeddings — architecture

Type: LanguageModelEmbeddingsConfig Default: (sub-fields optional)

Configuration for the language model embeddings.

head — architecture

Type: LanguageModelHeadConfig Default: (sub-fields optional)

Configuration for the language model head(s).

hidden_size — architecture

Type: int Default: 1024

Size of the model's main hidden dimension, e.g., for its input and output layers.

peft — architecture

Type: PeftConfig Default: (sub-fields optional)

Configuration for parameter-efficient fine tuning.

tied_embedding_weight — architecture

Type: bool Default: False

Tie the output weights (logits) with the vocabulary embedding.

vision_encoder — architecture

Type: VisionEncoderConfig Default: (sub-fields optional)

Configuration for the vision encoder.

image_token_index — optional

Type: int or None Default: None

Index of the image token. Unused, but required for Hugging Face conversion.

lr_scale — feature

Type: float or None Default: None

Scaling factor for the layer learning rate. Combines multiplicatively with the scale set by the parent and child layers, if applicable.