Instruction Finetuning

In this guide, we provide step-by-step instructions to do a supervised finetuning (SFT) of the Llama 3.1 8B model on an instruction tuning dataset Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered. Typically, SFT differs from pretraining in 2 key ways:

Loss Computation: Instruction tuning datasets typically contain system, user and assistant messages formatted using a chat template e.g., ChatML. The loss is typically computed only on the tokens from the assistant messages, and hence the gradient updates do not include the system or user tokens.
Packing: Fast-LLM packs multiple documents into a sequence to maximize training throughput. However, paying attention to other documents in a sequence can detrimental to instruction-finetuned models. Moreover, packing to sequence length can mean some document are split across multiple sequences.

Fast-LLM provides options to modify both of these, as we will see in this guide. Please follow the Quick Start guide for the initial setup.

📚 Step 1: Download the Pretrained Model¶

git lfs install
git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./sft-tutorial/Llama-3.1-8B

🔄 Step 2: Format the dataset into a chat template¶

We'll use Llama-3.1-8B-Instruct's tokenizer for this tutorial. Let's create a folder first:

mkdir -p ./sft-tutorial/checkpoints/Llama-3.1-8B-Instruct

And then download the tokenizer with this Python script:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained("./sft-tutorial/Llama-3.1-8B-Instruct")

To disable loss computation on pieces of texts like the user/system messages or the chat template tags, we need to define the character spans for this masking. The example below formats the Magpie dataset using Llama-3.1-8B-Instruct's chat template and defines spans where the loss should be masked.

from datasets import load_dataset
from transformers import AutoTokenizer

def apply_chat_template(conversation):
    chatml_conv = []
    for conv in conversation:
        if conv["from"] == "human":
            chatml_conv.append({"role": "user", "content": conv["value"]})
        elif conv["from"] == "gpt":
            chatml_conv.append({"role": "assistant", "content": conv["value"]})
    return tokenizer.apply_chat_template(conversation=chatml_conv, tokenize=False)


def get_spans(text: str, start_delimiter: str, end_delimiter: str):
    spans = []
    start = 0
    while start < len(text):
        end = text.find(start_delimiter, start) + len(start_delimiter)
        if end == -1:
            break
        spans.append([start, end - 1])  # character span indices are inclusive
        start = text.find(end_delimiter, end) + len(end_delimiter)
    return spans


def get_sample_with_spans(sample, start_delimiter, end_delimiter):
    text = apply_chat_template(sample["conversations"])
    spans = get_spans(text, start_delimiter, end_delimiter)
    return {"text": text, "spans": spans}

dataset_name = "Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered"
tokenizer_path = "./sft-tutorial/Llama-3.1-8B-Instruct"
dataset = load_dataset(dataset_name)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

start_delimiter = "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
end_delimiter = "<|eot_id|>"
dataset = dataset.map(lambda x: get_sample_with_spans(x, start_delimiter, end_delimiter), num_proc=16)
dataset.save_to_disk("./sft-tutorial/chatml_dataset")

🔍 Step 3: Prepare the dataset¶

This step is similar to Data preparation, with one additional option for the field with loss masking spans.

output_path: ./sft-tutorial/tokenized/Llama-3.1-8B

loading_workers: 32
tokenize_workers: 32
saving_workers: 32

dataset:
  path: ./sft-tutorial/hf_dataset
  split: "train"
  trust_remote_code: true
  field: text
  loss_masking_spans: spans

tokenizer:
  path: ./sft-tutorial/Llama-3.1-8B
splits:
  training: 0.998
  validation: 0.002

Some HuggingFace datasets ship custom Python loaders that run on download. The trust_remote_code: true field above only takes effect when you also pass --trust-remote-code on the fast-llm prepare command line; the field alone is intentionally a no-op so that opening someone else's config can never trigger remote-code execution. Both opt-ins are required (--trust-remote-code is the master switch and trust_remote_code: true enables it for that specific call). If you would rather opt in once for every call in the run, pass --trust-all-remote-code instead.

⚙️ Step 4: Configure Fast-LLM¶

It's time to configure the Fast-LLM training config. This is very similar to Quick Start with one additional option, namely truncate_documents, which is important for improving the task performance of instruction-tuned models.

training:
  train_iters: 5_000
  logs:
    interval: 1
  evaluators:
    validation:
      interval: 100
      evaluator:
        type: loss
        iterations: 25
        dataset_name: validation
  checkpoint:
    interval: 1000
    keep: 5
  export:
    format: llama
    interval: 1000
data:
  micro_batch_size: 4096
  maximum_document_length: 4096
  datasets:
    training:
      type: file
      path: ./sft-tutorial/tokenized/Llama-3.1-8B/fast_llm_config_training.yaml
    validation:
      type: file
      path: ./sft-tutorial/tokenized/Llama-3.1-8B/fast_llm_config_validation.yaml
  truncate_documents: no # (1)!
  sampling:
    use_loss_masking_spans: yes
optimizer:
  weight_decay: 0.1
  beta_1: 0.9
  beta_2: 0.95
  learning_rate:
    base: 2.0e-05
    minimum: 0.0
    decay_style: cosine
    decay_iterations: 4_900
    warmup_iterations: 100
pretrained:
  format: llama
  path: ./sft-tutorial/Llama-3.1-8B
  model_weights: yes
model:
  base_model:
    decoder:
      block:
        mixer:
          use_flash_attention: yes
  multi_stage:
    zero_stage: 3
  distributed:
    timeout: 3600
    compute_dtype: bf16
run:
  experiment_dir: ./sft-tutorial/llama-3.1-8b-instruct-magpie

Avoids truncating documents to fit into a packed sequence and starts a new sequence instead. Documents longer than sequence length will be skipped altogether.

Launching the training run is similar to Step 7 in the Quick Start guide.