What Is Post-Training Alignment?

Stable

Post-training alignment trains a fine-tuned vision-language model (VLM) to prefer grounded, accurate outputs over hallucinated ones. After supervised fine-tuning (SFT) teaches the model what to do, alignment teaches it how to behave: which response style experts prefer, where to be cautious, and when to say "I don't know." Datature Vi uses Direct Preference Optimization (DPO) for this stage, which maps human feedback directly to gradient updates without a separate reward model.

On this page

How DPO works DPO vs RLHF When to use post-training Iterative alignment FAQ Related resources

How does DPO work?

DPO starts with a fine-tuned model and a set of preference pairs. Each pair contains two model outputs for the same prompt: one that an expert marked as better (chosen) and one marked as worse (rejected). The training objective increases the likelihood of chosen responses relative to rejected ones.

The preference signal

Domain experts review side-by-side model outputs for the same input image and prompt. They mark which response is more grounded, more accurate, or less hallucinated. This produces a (chosen, rejected) pair that becomes the training signal.

A typical alignment iteration collects 1,000 to 10,000 preference pairs. Reviewers do not need to write new text. They only compare and choose between two existing model outputs.

How the model learns from preferences

DPO uses the language model itself as an implicit reward model. Instead of training a separate network to score outputs (as RLHF does), DPO derives the reward signal from the log-probability difference between chosen and rejected responses. A single hyperparameter called beta controls the trade-off between alignment strength and divergence from the reference model.

Three key metrics track alignment progress:

Metric

What it measures

Target

Reward gap

Average log-probability difference between chosen and rejected responses

Wider gap = stronger preference signal

Preference accuracy

Fraction of held-out pairs where the model assigns higher probability to the expert-chosen response

Above 65% after first round

KL divergence

How far the aligned model has drifted from the reference checkpoint

Below 0.5 (controlled by beta)

How DPO differs from RLHF

Both DPO and RLHF use human preference data to align a model. They differ in complexity, cost, and stability.

Aspect

RLHF

DPO

Pipeline stages

Four: collect preferences, train reward model, run PPO, evaluate

Three: collect preferences, run DPO, evaluate

Models to train

Three (policy + reward + value)

One (policy only)

Typical GPU hours

48-120 hours

8-24 hours

Hyperparameters to tune

15+

3 (beta, learning rate, epochs)

Stability

Sensitive to reward hacking; reward model can drift

Inherently stable (KL-constrained)

Separate reward model

Required; needs retraining when policy changes

Not needed; policy is its own reward model

DPO removes the reward model and the reinforcement learning loop. This cuts training time, reduces the number of moving parts, and eliminates a common source of instability: reward model drift.

When should you use post-training?

SFT and DPO solve different problems:

SFT teaches the model a new task: how to ground phrases in images, answer questions about medical scans, or extract structured data from documents. It needs annotated examples of correct input-output pairs.
DPO polishes behavior after the model already knows the task. It reduces hallucinations, improves grounding accuracy, and aligns response style with expert expectations. It needs preference comparisons, not new annotations.

Use post-training when your SFT model produces reasonable outputs but you want to push accuracy higher, reduce specific failure modes, or align the model's judgment with domain experts.

SFT first, DPO second

DPO requires a fine-tuned checkpoint as its starting point. You cannot skip SFT and go straight to DPO. Train with SFT until the model produces coherent task-relevant outputs, then use DPO to refine behavior.

Iterative alignment rounds

Alignment improves through multiple rounds. Each round collects new preference pairs, runs DPO, and evaluates the result. Early rounds fix obvious failure modes. Later rounds target harder edge cases.

Round 1: Broad coverage

Collect preference pairs across all prompt categories. Focus on the most common failure modes: hallucinated objects, incorrect spatial descriptions, wrong counts. Expect preference accuracy around 68% with roughly 2,000-2,400 pairs.

Round 2: Edge case mining

Surface samples where the model is least confident (chosen vs rejected probability closest to 50/50). Focus on spatial grounding and multi-object scenes. Accuracy typically reaches 76% with 1,500-1,800 targeted pairs.

Round 3: Failure mode targeting

Cluster remaining errors by type: negation handling, counting errors, fine-grained attribute distinctions. Collect pairs that address the largest remaining cluster. Accuracy reaches roughly 82% with 1,000-1,200 pairs.

Round 4 and beyond: Convergence

Gains diminish as the model addresses most failure modes. When preference accuracy plateaus, shift to domain-specific edge cases or new prompt categories. Accuracy above 85% signals convergence for most tasks.

Each round saves a new checkpoint with full metadata: preference dataset version, training config, evaluation metrics, and parent checkpoint lineage. You can roll back to any previous iteration.

Frequently asked questions

Related resources

Post-Training Overview

Set up preference collection and run DPO alignment in Datature Vi.

How Does VLM Training Work?

Epochs, batch size, learning rate, and loss curves for SFT.

How Do I Evaluate My Model?

Understand IoU, F1, BLEU, BERTScore, and what good scores look like.

LoRA and Quantization

Reduce training cost with low-rank adaptation and 4-bit precision.

Improve Your Model

Diagnose weak results and retrain with better data.

Resource Usage

GPU memory specs and compute credit costs.