Fine-tuning is taking a pretrained model and continuing to train it on a smaller, more focused dataset. The model already understands general language — you're guiding it toward a specialty.
Why do modern chatbots feel helpful, polite, and safe? Without any guidance, a model just continues patterns — it'd say anything that sounds probable, even if harmful.
RLHF (Reinforcement Learning from Human Feedback) fixes this by injecting human judgment into training. Humans rank AI responses, and the model learns to prefer what people actually like.
Over thousands of these comparisons, the model learns a sense of preference — what helpful, clear, and safe answers look like. That's why modern AI feels very different from raw language models.
Fine-tuning a huge model means updating billions of parameters — expensive and hard to manage. LoRA (Low-Rank Adaptation) is the clever shortcut.
Instead of modifying the entire model, LoRA keeps the original frozen and adds tiny trainable components on top — often less than 1% of the total parameters.
As models get bigger, running them gets harder. Quantization solves this by storing model weights with less precision — using fewer bits per number.
A full-precision model stores each weight in 32 bits. Quantizing to 4 bits makes it 8× smaller. Quality drops a tiny bit, but the model becomes much more practical to run.
Example: A 70B parameter model at different precisions. 4-bit lets you run it on a desktop GPU!