Instruction Finetuning
In this guide, we provide step-by-step instructions to do a supervised finetuning (SFT) of the Llama 3.1 8B model on an instruction tuning dataset Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered. Typically, SFT differs from pretraining in 2 key ways:
- Loss Computation: Instruction tuning datasets typically contain system, user and assistant messages formatted using a chat template e.g., ChatML. The loss is typically computed only on the tokens from the assistant messages, and hence the gradient updates do not include the system or user tokens.
- Packing: Fast-LLM packs multiple documents into a sequence to maximize training throughput. However, paying attention to other documents in a sequence can detrimental to instruction-finetuned models. Moreover, packing to sequence length can mean some document are split across multiple sequences.
Fast-LLM provides options to modify both of these, as we will see in this guide. Please follow the Quick Start guide for the initial setup.
📚 Step 1: Download the Pretrained Model¶
git lfs install
git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./sft-tutorial/Llama-3.1-8B
🔄 Step 2: Format the dataset into a chat template¶
We'll use Llama-3.1-8B-Instruct's tokenizer for this tutorial. Let's create a folder first:
And then download the tokenizer with this Python script:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained("./sft-tutorial/Llama-3.1-8B-Instruct")
To disable loss computation on pieces of texts like the user/system messages or the chat template tags, we need to define the character spans for this masking. The example below formats the Magpie dataset using Llama-3.1-8B-Instruct's chat template and defines spans where the loss should be masked.
from datasets import load_dataset
from transformers import AutoTokenizer
def apply_chat_template(conversation):
chatml_conv = []
for conv in conversation:
if conv["from"] == "human":
chatml_conv.append({"role": "user", "content": conv["value"]})
elif conv["from"] == "gpt":
chatml_conv.append({"role": "assistant", "content": conv["value"]})
return tokenizer.apply_chat_template(conversation=chatml_conv, tokenize=False)
def get_spans(text: str, start_delimiter: str, end_delimiter: str):
spans = []
start = 0
while start < len(text):
end = text.find(start_delimiter, start) + len(start_delimiter)
if end == -1:
break
spans.append([start, end - 1]) # character span indices are inclusive
start = text.find(end_delimiter, end) + len(end_delimiter)
return spans
def get_sample_with_spans(sample, start_delimiter, end_delimiter):
text = apply_chat_template(sample["conversations"])
spans = get_spans(text, start_delimiter, end_delimiter)
return {"text": text, "spans": spans}
dataset_name = "Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered"
tokenizer_path = "./sft-tutorial/Llama-3.1-8B-Instruct"
dataset = load_dataset(dataset_name)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
start_delimiter = "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
end_delimiter = "<|eot_id|>"
dataset = dataset.map(lambda x: get_sample_with_spans(x, start_delimiter, end_delimiter), num_proc=16)
dataset.save_to_disk("./sft-tutorial/chatml_dataset")
🔍 Step 3: Prepare the dataset¶
This step is similar to Data preparation, with one additional option for the field with loss masking spans.
output_path: ./sft-tutorial/tokenized/Llama-3.1-8B
loading_workers: 32
tokenize_workers: 32
saving_workers: 32
dataset:
path: ./sft-tutorial/hf_dataset
split: "train"
trust_remote_code: true
field: text
loss_masking_spans: spans
tokenizer:
path: ./sft-tutorial/Llama-3.1-8B
splits:
training: 0.998
validation: 0.002
⚙️ Step 4: Configure Fast-LLM¶
It's time to configure the Fast-LLM training config. This is very similar to Quick Start with two additional options, namely, truncate_documents
and cross_document_attention
which are important for improving the task performance of instruction-tuned models.
training:
train_iters: 5_000
logs:
interval: 1
evaluations:
validation:
iterations: 25
interval: 1000
checkpoint:
interval: 1000
keep: 5
test_iters: 0
export:
format: llama
interval: 1000
batch:
micro_batch_size: 1
sequence_length: 4096
batch_size: 32
cross_document_attention: no # (1)!
data:
datasets:
training:
type: file
path: ./sft-tutorial/tokenized/Llama-3.1-8B/fast_llm_config_training.yaml
validation:
type: file
path: ./sft-tutorial/tokenized/Llama-3.1-8B/fast_llm_config_validation.yaml
truncate_documents: no # (2)!
sampling:
use_loss_masking_spans: yes
optimizer:
weight_decay: 0.1
beta_1: 0.9
beta_2: 0.95
learning_rate:
base: 2.0e-05
minimum: 0.0
decay_style: cosine
decay_iterations: 4_900
warmup_iterations: 100
pretrained:
format: llama
path: ./sft-tutorial/Llama-3.1-8B
model_weights: yes
model:
base_model:
transformer:
use_flash_attention: yes
cross_entropy_impl: fused
multi_stage:
zero_stage: 3
distributed:
timeout: 3600
training_dtype: bf16
run:
experiment_dir: ./sft-tutorial/llama-3.1-8b-instruct-magpie
- Prevents paying attention to other documents in a packed sequence
- Avoids truncating documents to fit into a packed sequence and starts a new sequence instead. Documents longer than sequence length will be skipped altogether.
Launching the training run is similar to Step 7 in the Quick Start guide.