Posted 2025-11-26Updated 2025-12-10Machine Learning6 minutes read (About 888 words)

Merging LoRA Adapters and Serving Locally

In the SFT and DPO posts, I trained LoRA adapters using pure PyTorch. The adapters are tiny (~4MB), but at inference time you still need to load the base model, inject the LoRA wrappers, and load the adapter weights. What if you just want a single, standalone model you can run anywhere?

Merging folds the adapter back into the base weights permanently. The result is a standard model file with no adapter machinery required.

Why Merge?
Merging in Pure PyTorch
- Using It
- Verifying the Merge
Size Comparison
Serving Locally
- Converting to GGUF for llama.cpp / Ollama
When to Merge vs. Keep Separate
References

Why Merge?

During training, LoRA keeps the base weights frozen and stores the adaptation as separate low-rank matrices. The forward pass computes:

Merging does the matrix addition once and throws away the adapter:

Merging folds the LoRA matrices into the base weights, producing a single model

After merging, inference is just a standard forward pass through the merged weights. No LoRALinear wrappers, no adapter loading.

Merging in Pure PyTorch

Since we built our own LoRALinear in the SFT post, merging is straightforward. We walk the model, find every LoRALinear module, compute the merged weight, and replace it with a plain nn.Linear:

import torch
from sft_lora.sft.train import LoRALinear, inject_lora

def merge_lora(model):
    for name, module in list(model.named_modules()):
        if isinstance(module, LoRALinear):
            # W_merged = W + scale * B^T @ A^T
            merged_weight = (
                module.original.weight.data
                + module.scale * (module.B.data.T @ module.A.data.T)
            )
            new_linear = torch.nn.Linear(
                module.original.in_features,
                module.original.out_features,
                bias=module.original.bias is not None,
            )
            new_linear.weight.data = merged_weight
            if module.original.bias is not None:
                new_linear.bias.data = module.original.bias.data

            # Replace in parent module
            parts = name.split(".")
            parent = model
            for p in parts[:-1]:
                parent = getattr(parent, p)
            setattr(parent, parts[-1], new_linear)
    return model

The shape math: our LoRALinear.forward computes (x @ A @ B) * scale, which is equivalent to adding scale * B^T @ A^T to the original weight matrix (since nn.Linear stores weights transposed).

Using It

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0", dtype=torch.float32
)
inject_lora(model)

# Load trained LoRA weights
state = torch.load("output/sft/lora_weights.pt", map_location="cpu", weights_only=True)
for name, param in model.named_parameters():
    if name in state:
        param.data.copy_(state[name])

# Merge and save
model = merge_lora(model)
model.save_pretrained("output/merged")

Verifying the Merge

A quick sanity check: the merged model should produce identical output to the LoRA model:

1
2
3

prompt = "### Instruction:\nDescribe Go in structured format.\n\n### Response:\n"
# ... generate from both models ...
assert torch.equal(lora_output, merged_output)  # passes

Both produce:

1	{"name": "Go", "category": "programming language", "features": ["dynamic typing", "interpreter"]}

Size Comparison

Artifact	Size
LoRA adapter	4.3 MB
Merged model (fp32)	4.1 GB
Base model (fp16, from HF cache)	2.1 GB

The adapter is ~500x smaller than the full model. Once merged, the model is the same shape as the base (same number of parameters), just with slightly different values. The merged model is 4.1GB because we saved in float32. Converting to float16 or quantizing brings it back down.

Serving Locally

Once merged, you can serve the model with a simple interactive loop:

model = AutoModelForCausalLM.from_pretrained("output/merged", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("output/merged")

while True:
    user_input = input(">>> ")
    prompt = f"### Instruction:\n{user_input}\n\n### Response:\n"
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Converting to GGUF for llama.cpp / Ollama

For faster local inference, you can quantize the merged model to GGUF format:

# Convert and quantize to 8-bit
python llama.cpp/convert_hf_to_gguf.py ./output/merged \
    --outfile model.gguf --outtype q8_0

# Run with llama.cpp
./llama-cli -m model.gguf \
    -p "### Instruction:\nDescribe Python in structured format.\n\n### Response:\n"

# Or create an Ollama model
echo 'FROM ./model.gguf' > Modelfile
ollama create my-json-model -f Modelfile
ollama run my-json-model "Describe Python in structured format."

Quantization to Q8 roughly halves the model size (~1GB for TinyLlama) with minimal quality loss. Q4 gets you to ~600MB with more noticeable degradation. I will cover quantization in more detail in a future post.

When to Merge vs. Keep Separate

Keep adapters separate when:

You have multiple adapters for different tasks and want to swap them
You are still iterating on training
Storage matters (4MB vs 4GB)

Merge when:

You want a single deployable model
You need compatibility with runtimes that do not support custom LoRA wrappers (llama.cpp, vLLM, TGI)
You are done training and want to ship

References

Hu, E. J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. arXiv:2106.09685
llama.cpp: C/C++ inference engine with GGUF conversion and quantization support.
Ollama: Local LLM runner that uses GGUF models.
The full code for this post is available at github.com/mrrostam/blog-code/merge