Quantizing Your Own Fine-Tuned Model with llama.cpp

You can find almost any popular model pre-quantized on Hugging Face in every GGUF variant imaginable. But the moment you fine-tune your own model, you are on your own. Nobody has uploaded a Q4_K_M of your custom checkpoint. If you want to run it through Ollama or llama.cpp and actually use your GPU efficiently, you need to quantize it yourself.

This post walks through the full pipeline: taking the merged safetensors model from the merging post, converting it to GGUF, quantizing it at different levels, and measuring what you lose. If you want to understand the theory behind these quantization types, see the quantization post.

Starting Point

I am picking up where the merging post left off. We have a merged TinyLlama model in output/merged/ as safetensors, fine-tuned with LoRA to output structured JSON. The full code is at github.com/mrrostam/blog-code/quantize.

1
2
3
4
5
6
ls output/merged/
# config.json generation_config.json model.safetensors
# tokenizer.json tokenizer_config.json chat_template.jinja

du -sh output/merged/model.safetensors
# 2.1G (float32 weights for TinyLlama 1.1B)

Setting Up llama.cpp

1
2
3
4
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

On Apple Silicon, this builds with Metal support by default. On Linux with CUDA:

1
2
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

Step 1: Convert to GGUF (FP16)

The first step converts from Hugging Face safetensors format to GGUF. This does not quantize yet, it just changes the container format and converts weights to float16:

1
2
3
python llama.cpp/convert_hf_to_gguf.py output/merged/ \
--outfile output/gguf/model-f16.gguf \
--outtype f16

This gives you a baseline GGUF file at full float16 precision.

Step 2: Quantize

Now quantize the FP16 GGUF to different levels using llama-quantize:

1
2
3
4
5
6
for quant in Q8_0 Q6_K Q5_K_M Q4_K_M Q4_0 Q3_K_M Q2_K; do
./llama.cpp/build/bin/llama-quantize \
output/gguf/model-f16.gguf \
"output/gguf/model-${quant}.gguf" \
"$quant"
done

Each run takes a few seconds for a 1.1B model. Larger models take proportionally longer but it is still fast compared to GPTQ (which needs a calibration dataset and GPU).

Step 3: Compare Sizes

1
ls -lhS output/gguf/
File Quant Size
model-f16.gguf FP16 2.1 GB
model-Q8_0.gguf Q8_0 1.1 GB
model-Q6_K.gguf Q6_K 893 MB
model-Q5_K_M.gguf Q5_K_M 782 MB
model-Q4_K_M.gguf Q4_K_M 668 MB
model-Q4_0.gguf Q4_0 616 MB
model-Q3_K_M.gguf Q3_K_M 535 MB
model-Q2_K.gguf Q2_K 432 MB

Q4_K_M is about 3x smaller than FP16. Q2_K is nearly 5x smaller.

Step 4: Measure Speed

1
2
3
4
for f in output/gguf/model-f16.gguf output/gguf/model-Q4_K_M.gguf; do
echo "=== $(basename $f) ==="
./llama.cpp/build/bin/llama-bench -m "$f" -p 512 -n 128 -ngl 99
done

The -ngl 99 flag offloads all layers to GPU (Metal on Mac, CUDA on Linux). On my M2 MacBook:

Quant Prompt eval (tok/s) Generation (tok/s)
FP16 ~350 ~45
Q4_K_M ~800 ~85

Q4_K_M is roughly 2x faster for both prompt processing and generation, because less data needs to move through memory bandwidth.

Serving with Ollama

Once you have a GGUF file, getting it into Ollama is two commands:

1
2
echo 'FROM ./output/gguf/model-Q4_K_M.gguf' > Modelfile
ollama create my-json-model -f Modelfile

Then use it:

1
ollama run my-json-model "Describe Go in structured format."

Or through the API:

1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
"model": "my-json-model",
"prompt": "Describe Kafka in structured format.",
"stream": false
}'

The Pipeline

To summarize, going from a fine-tuned model to fast local inference:

  1. Merge LoRA adapter into base model (covered in the merging post)
  2. Convert safetensors to GGUF with convert_hf_to_gguf.py
  3. Quantize with llama-quantize (pick Q4_K_M as default)
  4. Serve with Ollama or llama.cpp

The entire process takes under a minute for a 1B model. For a 7B model, expect a few minutes. No GPU needed for the quantization step itself (unlike GPTQ or AWQ).

References

  1. llama.cpp: C/C++ inference engine with GGUF conversion and quantization.
  2. Ollama: Local LLM runner built on llama.cpp.
  3. GGUF format specification: File format documentation.