Fine-Tuning a Small LLM from Scratch with LoRA

Small language models are getting remarkably capable. Alibaba’s Qwen3.5 small model series, ranging from 0.6B to 7B parameters, runs on a laptop and punches well above its weight on benchmarks. I have been building a personal GTD system powered by these small, local models. The models understood my tasks, but they weren’t reliable: they’d return prose instead of the structured JSON the pipeline expected, or hallucinate values outside the expected schema. I didn’t need a smarter model, I needed a more obedient one.

Fine-tuning was the obvious fix, and I took it as an excuse to go deep: no frameworks, no abstractions, just PyTorch, LoRA, and a laptop.

What is Supervised Fine-Tuning?

SFT is conceptually simple: you have input-output pairs, and you train the model to maximize the probability of the output given the input. It’s the same language modeling objective used in pretraining, just on your curated dataset.

Given a dataset of pairs where is an instruction and is the target response, the SFT loss is the standard causal language modeling (next-token prediction) objective:

where are the model parameters and denotes all tokens before position .

This is just cross-entropy loss over the vocabulary at each position, teacher-forced. Nothing fancy, the same loss used to pretrain the model in the first place, but now on your specific task data.

The Problem with Full Fine-Tuning

A model like LLaMA-7B has ~7 billion parameters. Fine-tuning all of them requires:

  • Storing the full model in memory (~28GB in fp32)
  • Storing optimizer states (another ~56GB for Adam)
  • Gradient computation across all layers

For most tasks, this is overkill. The pretrained weights already encode useful representations. You just need to nudge them slightly.

LoRA: Low-Rank Adaptation

LoRA, introduced by Hu et al. (2021), is based on a key observation: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a full weight matrix , LoRA decomposes the update into two small matrices:

where and , with rank .

LoRA Architecture: the frozen weight matrix W is bypassed by a low-rank path through matrices A and B

During training:

  • is frozen (no gradients)
  • Only and are trained
  • The forward pass computes

The number of trainable parameters drops from to . For a typical attention layer with and , that’s parameters, a 256x reduction.

Scaling Factor

LoRA introduces a scaling factor applied to the low-rank update:

where is a hyperparameter (typically ). This controls the magnitude of the adaptation relative to the pretrained weights. Higher means the LoRA update has more influence.

Which Layers to Adapt?

The original paper found that adapting the query and value projection matrices (, ) in attention layers works well. Some practitioners also include and , or even the MLP layers, but and projections are the standard starting point.

Step-by-Step Implementation

Let’s build this from scratch in pure PyTorch. No HuggingFace Trainer, no PEFT library. We’ll fine-tune TinyLlama (1.1B parameters) to respond with structured JSON using a synthetic dataset.

1. Generate Training Data

We need instruction/output pairs where the output is always structured JSON:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import json
import random

TOPICS = [
("Python", "programming language", ["dynamic typing", "GC", "interpreted"]),
("Rust", "programming language", ["ownership", "zero-cost abstractions", "no GC"]),
("PostgreSQL", "database", ["ACID", "MVCC", "extensible"]),
("Redis", "database", ["in-memory", "key-value", "pub/sub"]),
("Docker", "devops tool", ["containers", "isolation", "portability"]),
# ... more topics
]


def generate_dataset(n=500, seed=42):
random.seed(seed)
examples = []
for _ in range(n):
name, category, features = random.choice(TOPICS)
feat = random.sample(features, k=2)
instruction = f"Describe {name} in structured format."
output = json.dumps({"name": name, "category": category, "features": feat})
examples.append({"instruction": instruction, "output": output})
return examples

Each example looks like:

1
2
3
4
{
"instruction": "Describe Rust in structured format.",
"output": "{\"name\": \"Rust\", \"category\": \"programming language\", \"features\": [\"ownership\", \"no GC\"]}"
}

The base model would respond with free-form text. After SFT, it should produce JSON.

2. Format and Tokenize

We concatenate instruction and output into a single sequence with a template:

1
2
3
4
5
6
7
8
9
10
11
12
13
def format_and_tokenize(examples, tokenizer, max_length=512):
texts = [
f"### Instruction:\n{ex['instruction']}\n\n"
f"### Response:\n{ex['output']}{tokenizer.eos_token}"
for ex in examples
]
return tokenizer(
texts,
truncation=True,
max_length=max_length,
padding="max_length",
return_tensors="pt",
)

The model learns to predict every token in this sequence, but the signal that matters is the response portion. The instruction tokens provide context.

3. Implement LoRA from Scratch

This is the core of it. A LoRA layer wraps a frozen linear layer with two small trainable matrices:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import torch.nn as nn


class LoRALinear(nn.Module):
def __init__(self, original: nn.Linear, r=8, alpha=16):
super().__init__()
self.original = original
self.original.weight.requires_grad_(False)
if original.bias is not None:
original.bias.requires_grad_(False)

d_out, d_in = original.weight.shape
self.A = nn.Parameter(torch.randn(d_in, r) * 0.01)
self.B = nn.Parameter(torch.zeros(r, d_out))
self.scale = alpha / r

def forward(self, x):
base = self.original(x)
lora = (x @ self.A @ self.B) * self.scale
return base + lora

That’s it. A is initialized with small random values, B starts at zero so the LoRA path initially contributes nothing. The original weights are frozen.

4. Apply LoRA to the Model

We replace the q_proj and v_proj layers in every attention block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float32
)

# Freeze everything
for param in model.parameters():
param.requires_grad_(False)

# Inject LoRA into attention layers
for layer in model.model.layers:
layer.self_attn.q_proj = LoRALinear(layer.self_attn.q_proj, r=8, alpha=16)
layer.self_attn.v_proj = LoRALinear(layer.self_attn.v_proj, r=8, alpha=16)

# Count trainable params
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.3f}%)")
# Trainable: 1,126,400 / 1,101,174,784 (0.102%)

Only 0.1% of parameters are trainable. The adapter weights will be ~4MB.

5. Training Loop

A standard PyTorch training loop. Nothing hidden behind abstractions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from torch.utils.data import DataLoader, TensorDataset

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token

# Tokenize and split 80/20
data = generate_dataset(n=500)
split = int(0.8 * len(data))
train_tokens = format_and_tokenize(data[:split], tokenizer)
val_tokens = format_and_tokenize(data[split:], tokenizer)

train_loader = DataLoader(
TensorDataset(train_tokens["input_ids"], train_tokens["attention_mask"]),
batch_size=8,
shuffle=True,
)
val_loader = DataLoader(
TensorDataset(val_tokens["input_ids"], val_tokens["attention_mask"]),
batch_size=8,
)

# Only optimize LoRA parameters
optimizer = torch.optim.AdamW(
[p for p in model.parameters() if p.requires_grad], lr=2e-4
)

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.to(device)

for epoch in range(3):
model.train()
total_loss = 0
for input_ids, attention_mask in train_loader:
input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
outputs = model(
input_ids=input_ids, attention_mask=attention_mask, labels=input_ids
)
optimizer.zero_grad()
outputs.loss.backward()
optimizer.step()
total_loss += outputs.loss.item()

# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for input_ids, attention_mask in val_loader:
input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
outputs = model(
input_ids=input_ids, attention_mask=attention_mask, labels=input_ids
)
val_loss += outputs.loss.item()

print(
f"Epoch {epoch+1}/3, Train: {total_loss/len(train_loader):.4f}, "
f"Val: {val_loss/len(val_loader):.4f}"
)

On an M-series Mac (MPS), you should see the train loss drop sharply in the first epoch, with validation loss tracking closely.

6. Evaluate

Generate from both the base model and our LoRA-injected model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def generate(model, tokenizer, prompt, max_new_tokens=128):
model.eval()
inputs = tokenizer(
f"### Instruction:\n{prompt}\n\n### Response:\n", return_tensors="pt"
).to(device)
with torch.no_grad():
output = model.generate(
**inputs, max_new_tokens=max_new_tokens, do_sample=False
)
return tokenizer.decode(
output[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True
)


print(generate(model, tokenizer, "Describe Python in structured format."))

Results

Here are the actual results from training TinyLlama 1.1B on 500 synthetic examples (400 train / 100 validation) for 3 epochs.

Training Loss

Training and validation loss over 150 steps

Training completed on Apple MPS (M-series Mac). The train loss drops sharply in the first epoch as the model learns the JSON structure, then steadily decreases through epochs 2 and 3. Validation loss tracks closely throughout, ending at 0.05 with no sign of overfitting.

Base Model vs. Finetuned

Side-by-side comparison: base model produces Go code while the fine-tuned model outputs valid JSON for an unseen topic

The base model interprets “structured format” literally and produces Go code. After SFT, the same prompt on an unseen topic yields clean, parseable JSON.

A note on correctness: Look closely at the fine-tuned output. The format is right, but the content isn’t. Go is listed with features like “dynamic typing” and “interpreted,” which are wrong. The model learned how to respond (valid JSON) but is confabulating facts for topics outside the training set. With only 500 synthetic examples covering a handful of technologies, this is expected: the model has no signal for what Go’s features actually are, so it fills in plausible-sounding values. This highlights a broader risk with SFT: fine-tuning too aggressively can also degrade the model’s existing knowledge (sometimes called catastrophic forgetting). In practice, you always want a regression suite, a broad evaluation that checks the model’s general capabilities haven’t degraded after fine-tuning.

Key Takeaways

  • SFT is just next-token prediction on curated data, the same objective as pretraining
  • LoRA makes this practical by training ~0.1% of parameters via low-rank matrix decomposition
  • A few hundred examples and 3 epochs is enough to teach a clear pattern
  • The adapter weights are tiny (~4MB) and can be swapped without reloading the base model

References

  1. Hu, E. J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. arXiv:2106.09685
  2. Dettmers, T., et al. “QLoRA: Efficient Finetuning of Quantized Language Models.” NeurIPS 2023. arXiv:2305.14314
  3. The full code for this post is available at github.com/mrrostam/blog-code/sft-lora

Appendix: Frameworks and Tools

Everything in this post was implemented in pure PyTorch to make the mechanics visible. If you’re moving beyond experimentation, these frameworks handle the boilerplate and scale better:

  • TRL (Transformer Reinforcement Learning): Hugging Face’s library for SFT, DPO, PPO, and RLHF. SFTTrainer handles formatting, packing, and LoRA integration in a few lines. The most popular choice for fine-tuning workflows.
  • Axolotl: YAML-config-driven fine-tuning. Supports LoRA, QLoRA, full fine-tuning, multi-GPU, Flash Attention, and many dataset formats out of the box. Good for running experiments without writing training code.
  • Unsloth: Optimized for speed. 2x faster training and 70% less memory through custom Triton kernels. Drop-in replacement for Hugging Face trainers, supports Llama, Qwen, Mistral, Gemma, and more.
  • LLaMA-Factory: Web UI and CLI for fine-tuning 100+ LLMs. Supports SFT, RLHF, DPO, and various quantization methods. Lowest barrier to entry.
  • PEFT (Parameter-Efficient Fine-Tuning): Hugging Face’s library for LoRA, QLoRA, prefix tuning, prompt tuning, and other adapter methods. Works with any Hugging Face model.
  • torchtune: PyTorch-native fine-tuning library from the PyTorch team. Clean, modular, and well-documented. Good if you prefer staying close to raw PyTorch without the Hugging Face ecosystem.