Quantizing Your Own Fine-Tuned Model with llama.cpp

You can find almost any popular model pre-quantized on Hugging Face in every GGUF variant imaginable. But the moment you fine-tune your own model, you are on your own. Nobody has uploaded a Q4_K_M of your custom checkpoint. If you want to run it through Ollama or llama.cpp and actually use your GPU efficiently, you need to quantize it yourself.

This post walks through the full pipeline: taking the merged safetensors model from the merging post, converting it to GGUF, quantizing it at different levels, and measuring what you lose. If you want to understand the theory behind these quantization types, see the quantization post.

Read more

How LLM Quantization Actually Works

I got interested in quantization from two directions. At work, I spent time inside picoLLM‘s inference engine, digging into how its compression algorithm squeezes models down for on-device deployment. Outside of work, every time a new local model drops, I find myself staring at a list of GGUF variants on Ollama or llama.cpp trying to pick between Q4_K_M, Q5_K_S, Q8_0, and wondering what I am actually trading off. This post is what I have learned about quantization through those experiences.

Read more

Reward Modeling: Scoring LLM Outputs

In the DPO post, I used DPO to skip the reward model entirely. DPO’s whole selling point is that you don’t need one. But building a reward model completes the picture of how RLHF actually works, and it lets us do something interesting: compare the explicit reward model’s scores to DPO’s implicit reward and see if they agree.

Read more

Merging LoRA Adapters and Serving Locally

In the SFT and DPO posts, I trained LoRA adapters using pure PyTorch. The adapters are tiny (~4MB), but at inference time you still need to load the base model, inject the LoRA wrappers, and load the adapter weights. What if you just want a single, standalone model you can run anywhere?

Merging folds the adapter back into the base weights permanently. The result is a standard model file with no adapter machinery required.

Read more

Teaching a Small LLM to Prefer JSON Over Prose

In my previous post, I used SFT with LoRA to teach a small model to respond in structured JSON. It worked, but SFT is imitation learning: you show the model exactly what to produce, and it copies the pattern. What if instead of demonstrating the right answer, you just tell the model which answer you prefer?

That is the idea behind RLHF, and it is how most production LLMs are aligned after pretraining. The traditional approach uses PPO with a separate reward model, which is notoriously finicky. DPO (Direct Preference Optimization) sidesteps all of that, collapsing the reward model and RL loop into a single supervised loss function.

Read more

Fine-Tuning a Small LLM from Scratch with LoRA

Small language models are getting remarkably capable. Alibaba’s Qwen3.5 small model series, ranging from 0.6B to 7B parameters, runs on a laptop and punches well above its weight on benchmarks. I have been building a personal GTD system powered by these small, local models. The models understood my tasks, but they weren’t reliable: they’d return prose instead of the structured JSON the pipeline expected, or hallucinate values outside the expected schema. I didn’t need a smarter model, I needed a more obedient one.

Fine-tuning was the obvious fix, and I took it as an excuse to go deep: no frameworks, no abstractions, just PyTorch, LoRA, and a laptop.

Read more