Quantizing Your Own Fine-Tuned Model with llama.cpp
You can find almost any popular model pre-quantized on Hugging Face in every GGUF variant imaginable. But the moment you fine-tune your own model, you are on your own. Nobody has uploaded a Q4_K_M of your custom checkpoint. If you want to run it through Ollama or llama.cpp and actually use your GPU efficiently, you need to quantize it yourself.
This post walks through the full pipeline: taking the merged safetensors model from the merging post, converting it to GGUF, quantizing it at different levels, and measuring what you lose. If you want to understand the theory behind these quantization types, see the quantization post.
Starting Point
I am picking up where the merging post left off. We have a merged TinyLlama model in output/merged/ as safetensors, fine-tuned with LoRA to output structured JSON. The full code is at github.com/mrrostam/blog-code/quantize.
1 | ls output/merged/ |
Setting Up llama.cpp
1 | git clone https://github.com/ggml-org/llama.cpp |
On Apple Silicon, this builds with Metal support by default. On Linux with CUDA:
1 | cmake -B build -DGGML_CUDA=ON |
Step 1: Convert to GGUF (FP16)
The first step converts from Hugging Face safetensors format to GGUF. This does not quantize yet, it just changes the container format and converts weights to float16:
1 | python llama.cpp/convert_hf_to_gguf.py output/merged/ \ |
This gives you a baseline GGUF file at full float16 precision.
Step 2: Quantize
Now quantize the FP16 GGUF to different levels using llama-quantize:
1 | for quant in Q8_0 Q6_K Q5_K_M Q4_K_M Q4_0 Q3_K_M Q2_K; do |
Each run takes a few seconds for a 1.1B model. Larger models take proportionally longer but it is still fast compared to GPTQ (which needs a calibration dataset and GPU).
Step 3: Compare Sizes
1 | ls -lhS output/gguf/ |
| File | Quant | Size |
|---|---|---|
| model-f16.gguf | FP16 | 2.1 GB |
| model-Q8_0.gguf | Q8_0 | 1.1 GB |
| model-Q6_K.gguf | Q6_K | 893 MB |
| model-Q5_K_M.gguf | Q5_K_M | 782 MB |
| model-Q4_K_M.gguf | Q4_K_M | 668 MB |
| model-Q4_0.gguf | Q4_0 | 616 MB |
| model-Q3_K_M.gguf | Q3_K_M | 535 MB |
| model-Q2_K.gguf | Q2_K | 432 MB |
Q4_K_M is about 3x smaller than FP16. Q2_K is nearly 5x smaller.
Step 4: Measure Speed
1 | for f in output/gguf/model-f16.gguf output/gguf/model-Q4_K_M.gguf; do |
The -ngl 99 flag offloads all layers to GPU (Metal on Mac, CUDA on Linux). On my M2 MacBook:
| Quant | Prompt eval (tok/s) | Generation (tok/s) |
|---|---|---|
| FP16 | ~350 | ~45 |
| Q4_K_M | ~800 | ~85 |
Q4_K_M is roughly 2x faster for both prompt processing and generation, because less data needs to move through memory bandwidth.
Serving with Ollama
Once you have a GGUF file, getting it into Ollama is two commands:
1 | echo 'FROM ./output/gguf/model-Q4_K_M.gguf' > Modelfile |
Then use it:
1 | ollama run my-json-model "Describe Go in structured format." |
Or through the API:
1 | curl http://localhost:11434/api/generate -d '{ |
The Pipeline
To summarize, going from a fine-tuned model to fast local inference:
- Merge LoRA adapter into base model (covered in the merging post)
- Convert safetensors to GGUF with
convert_hf_to_gguf.py - Quantize with
llama-quantize(pick Q4_K_M as default) - Serve with Ollama or llama.cpp
The entire process takes under a minute for a 1B model. For a 7B model, expect a few minutes. No GPU needed for the quantization step itself (unlike GPTQ or AWQ).
References
- llama.cpp: C/C++ inference engine with GGUF conversion and quantization.
- Ollama: Local LLM runner built on llama.cpp.
- GGUF format specification: File format documentation.