What Is Post-Training Alignment?

Learn how Direct Preference Optimization (DPO) aligns a fine-tuned VLM with expert preferences. Understand the preference signal, how DPO compares to RLHF, and when to add a post-training stage.

Stable

Post-training alignment trains a fine-tuned vision-language model (VLM) to prefer grounded, accurate outputs over hallucinated ones. After supervised fine-tuning (SFT) teaches the model what to do, alignment teaches it how to behave: which response style experts prefer, where to be cautious, and when to say "I don't know." Datature Vi uses Direct Preference Optimization (DPO) for this stage, which maps human feedback directly to gradient updates without a separate reward model.


How does DPO work?

DPO starts with a fine-tuned model and a set of preference pairs. Each pair contains two model outputs for the same prompt: one that an expert marked as better (chosen) and one marked as worse (rejected). The training objective increases the likelihood of chosen responses relative to rejected ones.

The preference signal

Domain experts review side-by-side model outputs for the same input image and prompt. They mark which response is more grounded, more accurate, or less hallucinated. This produces a (chosen, rejected) pair that becomes the training signal.

A typical alignment iteration collects 1,000 to 10,000 preference pairs. Reviewers do not need to write new text. They only compare and choose between two existing model outputs.

How the model learns from preferences

DPO uses the language model itself as an implicit reward model. Instead of training a separate network to score outputs (as RLHF does), DPO derives the reward signal from the log-probability difference between chosen and rejected responses. A single hyperparameter called beta controls the trade-off between alignment strength and divergence from the reference model.

Three key metrics track alignment progress:

Metric
What it measures
Target
Reward gap
Average log-probability difference between chosen and rejected responses
Wider gap = stronger preference signal
Preference accuracy
Fraction of held-out pairs where the model assigns higher probability to the expert-chosen response
Above 65% after first round
KL divergence
How far the aligned model has drifted from the reference checkpoint
Below 0.5 (controlled by beta)

How DPO differs from RLHF

Both DPO and RLHF use human preference data to align a model. They differ in complexity, cost, and stability.

Aspect
RLHF
DPO
Pipeline stages
Four: collect preferences, train reward model, run PPO, evaluate
Three: collect preferences, run DPO, evaluate
Models to train
Three (policy + reward + value)
One (policy only)
Typical GPU hours
48-120 hours
8-24 hours
Hyperparameters to tune
15+
3 (beta, learning rate, epochs)
Stability
Sensitive to reward hacking; reward model can drift
Inherently stable (KL-constrained)
Separate reward model
Required; needs retraining when policy changes
Not needed; policy is its own reward model

DPO removes the reward model and the reinforcement learning loop. This cuts training time, reduces the number of moving parts, and eliminates a common source of instability: reward model drift.


When should you use post-training?

SFT and DPO solve different problems:

  • SFT teaches the model a new task: how to ground phrases in images, answer questions about medical scans, or extract structured data from documents. It needs annotated examples of correct input-output pairs.
  • DPO polishes behavior after the model already knows the task. It reduces hallucinations, improves grounding accuracy, and aligns response style with expert expectations. It needs preference comparisons, not new annotations.

Use post-training when your SFT model produces reasonable outputs but you want to push accuracy higher, reduce specific failure modes, or align the model's judgment with domain experts.

SFT first, DPO second

DPO requires a fine-tuned checkpoint as its starting point. You cannot skip SFT and go straight to DPO. Train with SFT until the model produces coherent task-relevant outputs, then use DPO to refine behavior.


Iterative alignment rounds

Alignment improves through multiple rounds. Each round collects new preference pairs, runs DPO, and evaluates the result. Early rounds fix obvious failure modes. Later rounds target harder edge cases.

1

Round 1: Broad coverage

Collect preference pairs across all prompt categories. Focus on the most common failure modes: hallucinated objects, incorrect spatial descriptions, wrong counts. Expect preference accuracy around 68% with roughly 2,000-2,400 pairs.

2

Round 2: Edge case mining

Surface samples where the model is least confident (chosen vs rejected probability closest to 50/50). Focus on spatial grounding and multi-object scenes. Accuracy typically reaches 76% with 1,500-1,800 targeted pairs.

3

Round 3: Failure mode targeting

Cluster remaining errors by type: negation handling, counting errors, fine-grained attribute distinctions. Collect pairs that address the largest remaining cluster. Accuracy reaches roughly 82% with 1,000-1,200 pairs.

4

Round 4 and beyond: Convergence

Gains diminish as the model addresses most failure modes. When preference accuracy plateaus, shift to domain-specific edge cases or new prompt categories. Accuracy above 85% signals convergence for most tasks.

Each round saves a new checkpoint with full metadata: preference dataset version, training config, evaluation metrics, and parent checkpoint lineage. You can roll back to any previous iteration.


Frequently asked questions

Start with 1,000-2,000 pairs for the first round. Quality matters more than volume: 1,000 well-judged pairs outperform 5,000 noisy ones. Each subsequent round typically needs fewer pairs (1,200-1,800) because you are targeting specific failure modes rather than covering everything.

DPO works best with reviewers who understand the domain. A radiologist can distinguish a clinically accurate report from a plausible-sounding hallucination. A generalist reviewer might miss subtle errors. If you lack domain experts, start with SFT and add more annotated training data instead.

Beta controls alignment strength. Low values (0.05-0.1) allow the model to change more aggressively toward preferred outputs. High values (0.3-0.5) keep the model closer to its pre-alignment behavior. Start with 0.1 and increase if you see the model diverging too far from its original capabilities (KL divergence above 0.5).

No. DPO builds on top of SFT. You need a fine-tuned model that already produces reasonable outputs for your task. DPO then refines those outputs based on expert preferences. Think of SFT as teaching and DPO as coaching.

A single A100 GPU processes roughly 2,400 preference pairs in under 4 hours. Smaller pair sets finish faster. Multi-GPU setups reduce wall-clock time further. Total GPU hours for one DPO round are typically 8-24 hours depending on pair count and hardware.


Related resources