Mohammadreza Rostam's Website

Posted 2025-12-05Updated 2025-12-19Machine Learning7 minutes read (About 1053 words)

In the DPO post, I used DPO to skip the reward model entirely. DPO’s whole selling point is that you don’t need one. But building a reward model completes the picture of how RLHF actually works, and it lets us do something interesting: compare the explicit reward model’s scores to DPO’s implicit reward and see if they agree.

Posted 2025-11-13Updated 2025-11-28Machine Learning11 minutes read (About 1649 words)

Teaching a Small LLM to Prefer JSON Over Prose

In my previous post, I used SFT with LoRA to teach a small model to respond in structured JSON. It worked, but SFT is imitation learning: you show the model exactly what to produce, and it copies the pattern. What if instead of demonstrating the right answer, you just tell the model which answer you prefer?

That is the idea behind RLHF, and it is how most production LLMs are aligned after pretraining. The traditional approach uses PPO with a separate reward model, which is notoriously finicky. DPO (Direct Preference Optimization) sidesteps all of that, collapsing the reward model and RL loop into a single supervised loss function.

Posted 2025-11-01Updated 2025-11-14Machine Learning5 minutes read (About 702 words)

Fine-Tuning a Small LLM from Scratch with LoRA

Small language models are getting remarkably capable. Alibaba’s Qwen3.5 small model series, ranging from 0.6B to 7B parameters, runs on a laptop and punches well above its weight on benchmarks. I have been building a personal GTD system powered by these small, local models. The models understood my tasks, but they weren’t reliable: they’d return prose instead of the structured JSON the pipeline expected, or hallucinate values outside the expected schema. I didn’t need a smarter model, I needed a more obedient one.

Fine-tuning was the obvious fix, and I took it as an excuse to go deep: no frameworks, no abstractions, just PyTorch, LoRA, and a laptop.