Quantizing Your Own Fine-Tuned Model with llama.cpp

You can find almost any popular model pre-quantized on Hugging Face in every GGUF variant imaginable. But the moment you fine-tune your own model, you are on your own. Nobody has uploaded a Q4_K_M of your custom checkpoint. If you want to run it through Ollama or llama.cpp and actually use your GPU efficiently, you need to quantize it yourself.

This post walks through the full pipeline: taking the merged safetensors model from the merging post, converting it to GGUF, quantizing it at different levels, and measuring what you lose. If you want to understand the theory behind these quantization types, see the quantization post.

Read more

How LLM Quantization Actually Works

I got interested in quantization from two directions. At work, I spent time inside picoLLM‘s inference engine, digging into how its compression algorithm squeezes models down for on-device deployment. Outside of work, every time a new local model drops, I find myself staring at a list of GGUF variants on Ollama or llama.cpp trying to pick between Q4_K_M, Q5_K_S, Q8_0, and wondering what I am actually trading off. This post is what I have learned about quantization through those experiences.

Read more