What Is Post-Training Alignment?
Learn how Direct Preference Optimization (DPO) aligns a fine-tuned VLM with expert preferences. Understand the preference signal, how DPO compares to RLHF, and when to add a post-training stage.
Post-training alignment trains a fine-tuned vision-language model (VLM) to prefer grounded, accurate outputs over hallucinated ones. After supervised fine-tuning (SFT) teaches the model what to do, alignment teaches it how to behave: which response style experts prefer, where to be cautious, and when to say "I don't know." Datature Vi uses Direct Preference Optimization (DPO) for this stage, which maps human feedback directly to gradient updates without a separate reward model.
How does DPO work?
DPO starts with a fine-tuned model and a set of preference pairs. Each pair contains two model outputs for the same prompt: one that an expert marked as better (chosen) and one marked as worse (rejected). The training objective increases the likelihood of chosen responses relative to rejected ones.
The preference signal
Domain experts review side-by-side model outputs for the same input image and prompt. They mark which response is more grounded, more accurate, or less hallucinated. This produces a (chosen, rejected) pair that becomes the training signal.
A typical alignment iteration collects 1,000 to 10,000 preference pairs. Reviewers do not need to write new text. They only compare and choose between two existing model outputs.
How the model learns from preferences
DPO uses the language model itself as an implicit reward model. Instead of training a separate network to score outputs (as RLHF does), DPO derives the reward signal from the log-probability difference between chosen and rejected responses. A single hyperparameter called beta controls the trade-off between alignment strength and divergence from the reference model.
Three key metrics track alignment progress:
How DPO differs from RLHF
Both DPO and RLHF use human preference data to align a model. They differ in complexity, cost, and stability.
DPO removes the reward model and the reinforcement learning loop. This cuts training time, reduces the number of moving parts, and eliminates a common source of instability: reward model drift.
When should you use post-training?
SFT and DPO solve different problems:
- SFT teaches the model a new task: how to ground phrases in images, answer questions about medical scans, or extract structured data from documents. It needs annotated examples of correct input-output pairs.
- DPO polishes behavior after the model already knows the task. It reduces hallucinations, improves grounding accuracy, and aligns response style with expert expectations. It needs preference comparisons, not new annotations.
Use post-training when your SFT model produces reasonable outputs but you want to push accuracy higher, reduce specific failure modes, or align the model's judgment with domain experts.
DPO requires a fine-tuned checkpoint as its starting point. You cannot skip SFT and go straight to DPO. Train with SFT until the model produces coherent task-relevant outputs, then use DPO to refine behavior.
Iterative alignment rounds
Alignment improves through multiple rounds. Each round collects new preference pairs, runs DPO, and evaluates the result. Early rounds fix obvious failure modes. Later rounds target harder edge cases.
Round 1: Broad coverage
Collect preference pairs across all prompt categories. Focus on the most common failure modes: hallucinated objects, incorrect spatial descriptions, wrong counts. Expect preference accuracy around 68% with roughly 2,000-2,400 pairs.
Round 2: Edge case mining
Surface samples where the model is least confident (chosen vs rejected probability closest to 50/50). Focus on spatial grounding and multi-object scenes. Accuracy typically reaches 76% with 1,500-1,800 targeted pairs.
Round 3: Failure mode targeting
Cluster remaining errors by type: negation handling, counting errors, fine-grained attribute distinctions. Collect pairs that address the largest remaining cluster. Accuracy reaches roughly 82% with 1,000-1,200 pairs.
Round 4 and beyond: Convergence
Gains diminish as the model addresses most failure modes. When preference accuracy plateaus, shift to domain-specific edge cases or new prompt categories. Accuracy above 85% signals convergence for most tasks.
Each round saves a new checkpoint with full metadata: preference dataset version, training config, evaluation metrics, and parent checkpoint lineage. You can roll back to any previous iteration.
Frequently asked questions
Related resources
Post-Training Overview
Set up preference collection and run DPO alignment in Datature Vi.
How Does VLM Training Work?
Epochs, batch size, learning rate, and loss curves for SFT.
How Do I Evaluate My Model?
Understand IoU, F1, BLEU, BERTScore, and what good scores look like.
LoRA and Quantization
Reduce training cost with low-rank adaptation and 4-bit precision.
Improve Your Model
Diagnose weak results and retrain with better data.
Resource Usage
GPU memory specs and compute credit costs.
Updated 4 days ago
