Full SFT Training Guide

Configure full supervised fine-tuning (SFT) for maximum VLM accuracy in Datature Vi. Understand hardware requirements, multi-GPU setup, precision options, and when full SFT is worth the cost.

Before you start

Full supervised fine-tuning (SFT) updates every parameter in the model. This gives the training process maximum freedom to adapt the model to your domain, which can matter when your images look very different from the data the base model was trained on. The trade-off is cost: full SFT needs 5-10x more GPU memory and runs 2-3x slower than QLoRA.

This page explains when the extra cost is justified, what hardware you need, and how to configure the settings.


When to use full fine-tuning

Full SFT is not the default for a reason. QLoRA handles the vast majority of tasks at a fraction of the cost. Switch to full SFT only when you have evidence that QLoRA is insufficient.

Signal
Try full SFT?
Why
QLoRA metrics plateau after adding more data and epochs
Yes
The adapter may lack capacity to capture your domain
Images are very different from typical photos (microscopy, satellite, X-ray, spectrogram)
Yes
The base model vision encoder needs significant adjustment
You need absolute maximum accuracy and have the GPU budget
Yes
Full SFT gives the highest possible accuracy ceiling
First training run on a new task
No
Start with QLoRA to validate your approach at low cost
Limited GPU resources (single T4 or L4)
No
Full SFT on 7B+ models requires multi-GPU setups
Small dataset (under 200 images)
No
Full SFT with limited data risks overfitting; QLoRA is more stable

Hardware requirements

Full SFT requires storing model weights, their gradients, and optimizer states (AdamW keeps two state tensors per parameter). This totals roughly 12 bytes per parameter after gradient checkpointing, compared to ~2 bytes per parameter for QLoRA.

Full SFT VRAM estimates

Model
Estimated VRAM
Recommended GPU config
Qwen 3B
~12 GB
1x A10 (24 GB)
Qwen 7B
~28 GB
1x A100 (80 GB)
InternVL 8B
~32 GB
1x A100 (80 GB)
Qwen 32B
~128 GB
2-4x A100 (80 GB) or 2x H100
InternVL 38B
~142 GB
2-4x A100 (80 GB) or 2x H100

Models above 7B parameters generally require multiple GPUs for full SFT. Datature Vi supports up to 32 A100 GPUs or 64 H100 GPUs per training run. See GPU and Compute Resources Guide for tier details and multi-GPU scaling guidance.

Cost comparison

A 2-hour full SFT run on 4x A100 (80 GB) consumes 2,880 compute credits (120 min x 4 GPUs x 6.0 multiplier). The same task with QLoRA on 1x A10 might take 3 hours but consumes only 450 credits (180 min x 2.5 multiplier). Full SFT costs 6x more in this example. Make sure QLoRA is genuinely insufficient before committing.


Configure full SFT settings

In the workflow canvas, click the Model node and set training mode to Full fine-tuning. The following settings control how full SFT behaves.

Precision options

Precision determines the numeric format for weights and gradients during training.

Format
Memory per param
Training stability
When to use
BFloat16
2 bytes
Best on modern GPUs (Ampere+)
Default for all A100/H100 runs
Float16
2 bytes
Good, slightly less stable than BF16
Older GPUs without BFloat16 support
Float32
4 bytes
Highest precision
Debugging numerical issues only

BFloat16 is the recommended default. It provides the same memory savings as Float16 with a wider exponent range, which prevents the gradient underflow issues that can occur with Float16 on very large models.

Gradient checkpointing

Gradient checkpointing trades compute time for memory savings. Instead of storing all intermediate activations during the forward pass (which consume significant VRAM), the model recomputes them during the backward pass.

Datature Vi enables gradient checkpointing by default for full SFT. This reduces activation memory by 60-70% at the cost of roughly 20-30% slower training. For multi-GPU runs on large models, this trade-off is almost always worth it.

Multi-GPU configuration

When you select multiple GPUs in the hardware selection step, Datature Vi automatically configures distributed training:

  • Data parallelism splits each batch across GPUs. Each GPU processes a portion of the batch and synchronizes gradients after each step.
  • Model parallelism splits the model itself across GPUs when a single GPU cannot hold all parameters. Datature Vi selects the right parallelism strategy (FSDP or DeepSpeed ZeRO-3) based on your model size and GPU count.
  • NVLink is used for gradient synchronization on H100 GPUs, providing higher bandwidth than standard PCIe connections.

You do not need to configure parallelism settings manually. Select the GPU type and count, and the platform handles the rest.

Divide your estimated VRAM by the per-GPU VRAM, then round up and add headroom for batch data and activations.

Example: Qwen 32B full SFT needs ~128 GB. An A100 has 80 GB. Two A100s provide 160 GB total, giving 32 GB of headroom for batch data. This works, but 4x A100 gives more room for larger batch sizes and faster training.

For H100 GPUs with NVLink, the higher inter-GPU bandwidth means gradient synchronization is faster, making multi-GPU scaling more efficient. Consider H100 for runs with 8+ GPUs.


Monitor full SFT training

Full SFT runs take longer and consume more resources than QLoRA. Pay close attention to these signals during training:

Signal
What it means
Action
Training loss decreases steadily
Normal training progress
Continue
Validation loss stops improving for 3+ epochs
Early signs of overfitting
Consider stopping the run and using the best checkpoint so far
Validation loss rises while training loss falls
Overfitting confirmed
Stop the run. Use the checkpoint from before validation loss started rising
Loss spikes or oscillates
Learning rate too high
Reduce learning rate by half for the next run
GPU utilization drops below 50%
Possible data loading bottleneck
Check dataset size and batch settings

Full SFT is more prone to overfitting than QLoRA because all parameters are free to change. With small datasets (under 500 images), watch validation loss closely and stop early if it diverges from training loss.

For detailed metric interpretation, see Training Metrics. For per-image prediction analysis, see Advanced Evaluation. For log-level troubleshooting, see Training Logs.


Compare full SFT vs QLoRA results

After running both methods on the same dataset, compare them on your evaluation metrics (not just loss).

Factor
QLoRA advantage
Full SFT advantage
Training cost
5-10x cheaper compute credits
N/A
Training speed
2-3x faster wall-clock time
N/A
Accuracy ceiling
Within 1-3% for most tasks
Highest possible for novel domains
Overfitting risk
Lower (fewer trainable params)
Higher (all params updated)
Domain adaptation
Good for in-distribution tasks
Better for out-of-distribution images

The decision is task-specific. If QLoRA produces BERTScore F1 of 0.88 and full SFT produces 0.90, the 2-point difference may not justify 6x the cost. If QLoRA produces 0.72 and full SFT produces 0.85, the gap is worth it.

Run full SFT if ALL of these are true:

  1. You tried QLoRA with rank 16-32 and results are below your target.
  2. You tried adding more training data and it did not close the gap.
  3. You have multi-GPU resources (A100 or H100) available.
  4. Your dataset has at least 500 annotated images (to avoid overfitting).

If any of these are false, stick with QLoRA and focus on improving your data quality instead.


Frequently asked questions

Datature Vi does not combine NF4/FP4 quantization with full fine-tuning. Quantization freezes weights at reduced precision, which conflicts with full SFT's goal of updating every parameter. If you need lower memory, use QLoRA instead. See the QLoRA Training Guide.

Yes. NVILA-Lite does not support LoRA. All other architectures in Datature Vi support both LoRA and full SFT. See Model Architectures for the full compatibility matrix.

Three strategies, in order of impact: (1) use more training data with diverse examples, (2) reduce the number of epochs so the model sees the data fewer times, (3) monitor validation loss and stop training when it stops improving. Full SFT on small datasets (under 200 images) is especially prone to overfitting.

Yes. Train with QLoRA first to validate your dataset, system prompt, and general approach. Once you are satisfied that the setup is correct, create a new workflow with full SFT to push accuracy further. The two workflows run independently and you can compare their results in the evaluation dashboard.

Datature Vi uses AdamW by default for full SFT. AdamW maintains per-parameter learning rate adjustments and is the standard optimizer for transformer fine-tuning. You can change the optimizer in Model Settings, but AdamW is recommended for most cases.


Next steps

QLoRA Training Guide

Memory-efficient alternative with LoRA rank, alpha, and NF4 settings.

Start a Training Run

Select GPU hardware and launch your configured workflow.

Improve Your Model

Diagnose weak results and iterate on data and settings.