Full SFT Training Guide
Configure full supervised fine-tuning (SFT) for maximum VLM accuracy in Datature Vi. Understand hardware requirements, multi-GPU setup, precision options, and when full SFT is worth the cost.
- A workflow open in the workflow canvas with a model architecture selected
- Understanding of LoRA and quantization (you should have tried QLoRA first)
- Sufficient compute credits for multi-GPU training
- Familiarity with GPU tiers and VRAM requirements
Full supervised fine-tuning (SFT) updates every parameter in the model. This gives the training process maximum freedom to adapt the model to your domain, which can matter when your images look very different from the data the base model was trained on. The trade-off is cost: full SFT needs 5-10x more GPU memory and runs 2-3x slower than QLoRA.
This page explains when the extra cost is justified, what hardware you need, and how to configure the settings.
When to use full fine-tuning
Full SFT is not the default for a reason. QLoRA handles the vast majority of tasks at a fraction of the cost. Switch to full SFT only when you have evidence that QLoRA is insufficient.
Hardware requirements
Full SFT requires storing model weights, their gradients, and optimizer states (AdamW keeps two state tensors per parameter). This totals roughly 12 bytes per parameter after gradient checkpointing, compared to ~2 bytes per parameter for QLoRA.
Models above 7B parameters generally require multiple GPUs for full SFT. Datature Vi supports up to 32 A100 GPUs or 64 H100 GPUs per training run. See GPU and Compute Resources Guide for tier details and multi-GPU scaling guidance.
A 2-hour full SFT run on 4x A100 (80 GB) consumes 2,880 compute credits (120 min x 4 GPUs x 6.0 multiplier). The same task with QLoRA on 1x A10 might take 3 hours but consumes only 450 credits (180 min x 2.5 multiplier). Full SFT costs 6x more in this example. Make sure QLoRA is genuinely insufficient before committing.
Configure full SFT settings
In the workflow canvas, click the Model node and set training mode to Full fine-tuning. The following settings control how full SFT behaves.
Precision options
Precision determines the numeric format for weights and gradients during training.
BFloat16 is the recommended default. It provides the same memory savings as Float16 with a wider exponent range, which prevents the gradient underflow issues that can occur with Float16 on very large models.
Gradient checkpointing
Gradient checkpointing trades compute time for memory savings. Instead of storing all intermediate activations during the forward pass (which consume significant VRAM), the model recomputes them during the backward pass.
Datature Vi enables gradient checkpointing by default for full SFT. This reduces activation memory by 60-70% at the cost of roughly 20-30% slower training. For multi-GPU runs on large models, this trade-off is almost always worth it.
Multi-GPU configuration
When you select multiple GPUs in the hardware selection step, Datature Vi automatically configures distributed training:
- Data parallelism splits each batch across GPUs. Each GPU processes a portion of the batch and synchronizes gradients after each step.
- Model parallelism splits the model itself across GPUs when a single GPU cannot hold all parameters. Datature Vi selects the right parallelism strategy (FSDP or DeepSpeed ZeRO-3) based on your model size and GPU count.
- NVLink is used for gradient synchronization on H100 GPUs, providing higher bandwidth than standard PCIe connections.
You do not need to configure parallelism settings manually. Select the GPU type and count, and the platform handles the rest.
Monitor full SFT training
Full SFT runs take longer and consume more resources than QLoRA. Pay close attention to these signals during training:
Full SFT is more prone to overfitting than QLoRA because all parameters are free to change. With small datasets (under 500 images), watch validation loss closely and stop early if it diverges from training loss.
For detailed metric interpretation, see Training Metrics. For per-image prediction analysis, see Advanced Evaluation. For log-level troubleshooting, see Training Logs.
Compare full SFT vs QLoRA results
After running both methods on the same dataset, compare them on your evaluation metrics (not just loss).
The decision is task-specific. If QLoRA produces BERTScore F1 of 0.88 and full SFT produces 0.90, the 2-point difference may not justify 6x the cost. If QLoRA produces 0.72 and full SFT produces 0.85, the gap is worth it.
Frequently asked questions
Next steps
Updated 4 days ago
