Posted 2026-01-20Updated 2026-02-05Machine Learning5 minutes read (About 786 words)

Quantizing Your Own Fine-Tuned Model with llama.cpp

You can find almost any popular model pre-quantized on Hugging Face in every GGUF variant imaginable. But the moment you fine-tune your own model, you are on your own. Nobody has uploaded a Q4_K_M of your custom checkpoint. If you want to run it through Ollama or llama.cpp and actually use your GPU efficiently, you need to quantize it yourself.

This post walks through the full pipeline: taking the merged safetensors model from the merging post, converting it to GGUF, quantizing it at different levels, and measuring what you lose. If you want to understand the theory behind these quantization types, see the quantization post.

Starting Point
Setting Up llama.cpp
Step 1: Convert to GGUF (FP16)
Step 2: Quantize
Step 3: Compare Sizes
Step 4: Measure Speed
Serving with Ollama
The Pipeline
References

Starting Point

I am picking up where the merging post left off. We have a merged TinyLlama model in output/merged/ as safetensors, fine-tuned with LoRA to output structured JSON. The full code is at github.com/mrrostam/blog-code/quantize.

ls output/merged/
# config.json  generation_config.json  model.safetensors
# tokenizer.json  tokenizer_config.json  chat_template.jinja

du -sh output/merged/model.safetensors
# 2.1G (float32 weights for TinyLlama 1.1B)

Setting Up llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

On Apple Silicon, this builds with Metal support by default. On Linux with CUDA:

1 2	cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j

Step 1: Convert to GGUF (FP16)

The first step converts from Hugging Face safetensors format to GGUF. This does not quantize yet, it just changes the container format and converts weights to float16:

1
2
3

python llama.cpp/convert_hf_to_gguf.py output/merged/ \
    --outfile output/gguf/model-f16.gguf \
    --outtype f16

This gives you a baseline GGUF file at full float16 precision.

Step 2: Quantize

Now quantize the FP16 GGUF to different levels using llama-quantize:

for quant in Q8_0 Q6_K Q5_K_M Q4_K_M Q4_0 Q3_K_M Q2_K; do
    ./llama.cpp/build/bin/llama-quantize \
        output/gguf/model-f16.gguf \
        "output/gguf/model-${quant}.gguf" \
        "$quant"
done

Each run takes a few seconds for a 1.1B model. Larger models take proportionally longer but it is still fast compared to GPTQ (which needs a calibration dataset and GPU).

Step 3: Compare Sizes

1	ls -lhS output/gguf/

File	Quant	Size
model-f16.gguf	FP16	2.1 GB
model-Q8_0.gguf	Q8_0	1.1 GB
model-Q6_K.gguf	Q6_K	893 MB
model-Q5_K_M.gguf	Q5_K_M	782 MB
model-Q4_K_M.gguf	Q4_K_M	668 MB
model-Q4_0.gguf	Q4_0	616 MB
model-Q3_K_M.gguf	Q3_K_M	535 MB
model-Q2_K.gguf	Q2_K	432 MB

Q4_K_M is about 3x smaller than FP16. Q2_K is nearly 5x smaller.

Step 4: Measure Speed

for f in output/gguf/model-f16.gguf output/gguf/model-Q4_K_M.gguf; do
    echo "=== $(basename $f) ==="
    ./llama.cpp/build/bin/llama-bench -m "$f" -p 512 -n 128 -ngl 99
done

The -ngl 99 flag offloads all layers to GPU (Metal on Mac, CUDA on Linux). On my M2 MacBook:

Quant	Prompt eval (tok/s)	Generation (tok/s)
FP16	~350	~45
Q4_K_M	~800	~85

Q4_K_M is roughly 2x faster for both prompt processing and generation, because less data needs to move through memory bandwidth.

Serving with Ollama

Once you have a GGUF file, getting it into Ollama is two commands:

1 2	echo 'FROM ./output/gguf/model-Q4_K_M.gguf' > Modelfile ollama create my-json-model -f Modelfile

Then use it:

1	ollama run my-json-model "Describe Go in structured format."

Or through the API:

curl http://localhost:11434/api/generate -d '{
  "model": "my-json-model",
  "prompt": "Describe Kafka in structured format.",
  "stream": false
}'

The Pipeline

To summarize, going from a fine-tuned model to fast local inference:

Merge LoRA adapter into base model (covered in the merging post)
Convert safetensors to GGUF with convert_hf_to_gguf.py
Quantize with llama-quantize (pick Q4_K_M as default)
Serve with Ollama or llama.cpp

The entire process takes under a minute for a 1B model. For a 7B model, expect a few minutes. No GPU needed for the quantization step itself (unlike GPTQ or AWQ).

References

llama.cpp: C/C++ inference engine with GGUF conversion and quantization.
Ollama: Local LLM runner built on llama.cpp.
GGUF format specification: File format documentation.