Full SFT Training Guide

On this page

When to use full SFT Hardware requirements Configure settings Monitor training Full SFT vs QLoRA FAQ

Before you start

A workflow open in the workflow canvas with a model architecture selected
Understanding of LoRA and quantization (you should have tried QLoRA first)
Sufficient compute credits for multi-GPU training
Familiarity with GPU tiers and VRAM requirements

Full supervised fine-tuning (SFT) updates every parameter in the model. This gives the training process maximum freedom to adapt the model to your domain, which can matter when your images look very different from the data the base model was trained on. The trade-off is cost: full SFT needs 5-10x more GPU memory and runs 2-3x slower than QLoRA.

This page explains when the extra cost is justified, what hardware you need, and how to configure the settings.

When to use full fine-tuning

Full SFT is not the default for a reason. QLoRA handles the vast majority of tasks at a fraction of the cost. Switch to full SFT only when you have evidence that QLoRA is insufficient.

Signal

Try full SFT?

Why

QLoRA metrics plateau after adding more data and epochs

Yes

The adapter may lack capacity to capture your domain

Images are very different from typical photos (microscopy, satellite, X-ray, spectrogram)

Yes

The base model vision encoder needs significant adjustment

You need absolute maximum accuracy and have the GPU budget

Yes

Full SFT gives the highest possible accuracy ceiling

First training run on a new task

Start with QLoRA to validate your approach at low cost

Limited GPU resources (single T4 or L4)

Full SFT on 7B+ models requires multi-GPU setups

Small dataset (under 200 images)

Full SFT with limited data risks overfitting; QLoRA is more stable

Hardware requirements

Full SFT requires storing model weights, their gradients, and optimizer states (AdamW keeps two state tensors per parameter). This totals roughly 12 bytes per parameter after gradient checkpointing, compared to ~2 bytes per parameter for QLoRA.

Full SFT VRAM estimates

Model

Estimated VRAM

Recommended GPU config

Qwen 3B

~12 GB

1x A10 (24 GB)

Qwen 7B

~28 GB

1x A100 (80 GB)

InternVL 8B

~32 GB

1x A100 (80 GB)

Qwen 32B

~128 GB

2-4x A100 (80 GB) or 2x H100

InternVL 38B

~142 GB

2-4x A100 (80 GB) or 2x H100

Models above 7B parameters generally require multiple GPUs for full SFT. Datature Vi supports up to 32 A100 GPUs or 64 H100 GPUs per training run. See GPU and Compute Resources Guide for tier details and multi-GPU scaling guidance.

Cost comparison

A 2-hour full SFT run on 4x A100 (80 GB) consumes 2,880 compute credits (120 min x 4 GPUs x 6.0 multiplier). The same task with QLoRA on 1x A10 might take 3 hours but consumes only 450 credits (180 min x 2.5 multiplier). Full SFT costs 6x more in this example. Make sure QLoRA is genuinely insufficient before committing.

Configure full SFT settings

In the workflow canvas, click the Model node and set training mode to Full fine-tuning. The following settings control how full SFT behaves.

Precision options

Precision determines the numeric format for weights and gradients during training.

Format

Memory per param

Training stability

When to use

BFloat16

2 bytes

Best on modern GPUs (Ampere+)

Default for all A100/H100 runs

Float16

2 bytes

Good, slightly less stable than BF16

Older GPUs without BFloat16 support

Float32

4 bytes

Highest precision

Debugging numerical issues only

BFloat16 is the recommended default. It provides the same memory savings as Float16 with a wider exponent range, which prevents the gradient underflow issues that can occur with Float16 on very large models.

Gradient checkpointing

Gradient checkpointing trades compute time for memory savings. Instead of storing all intermediate activations during the forward pass (which consume significant VRAM), the model recomputes them during the backward pass.

Datature Vi enables gradient checkpointing by default for full SFT. This reduces activation memory by 60-70% at the cost of roughly 20-30% slower training. For multi-GPU runs on large models, this trade-off is almost always worth it.

Multi-GPU configuration

When you select multiple GPUs in the hardware selection step, Datature Vi automatically configures distributed training:

Data parallelism splits each batch across GPUs. Each GPU processes a portion of the batch and synchronizes gradients after each step.
Model parallelism splits the model itself across GPUs when a single GPU cannot hold all parameters. Datature Vi selects the right parallelism strategy (FSDP or DeepSpeed ZeRO-3) based on your model size and GPU count.
NVLink is used for gradient synchronization on H100 GPUs, providing higher bandwidth than standard PCIe connections.

You do not need to configure parallelism settings manually. Select the GPU type and count, and the platform handles the rest.

Monitor full SFT training

Full SFT runs take longer and consume more resources than QLoRA. Pay close attention to these signals during training:

Signal

What it means

Action

Training loss decreases steadily

Normal training progress

Continue

Validation loss stops improving for 3+ epochs

Early signs of overfitting

Consider stopping the run and using the best checkpoint so far

Validation loss rises while training loss falls

Overfitting confirmed

Stop the run. Use the checkpoint from before validation loss started rising

Loss spikes or oscillates

Learning rate too high

Reduce learning rate by half for the next run

GPU utilization drops below 50%

Possible data loading bottleneck

Check dataset size and batch settings

Full SFT is more prone to overfitting than QLoRA because all parameters are free to change. With small datasets (under 500 images), watch validation loss closely and stop early if it diverges from training loss.

For detailed metric interpretation, see Training Metrics. For per-image prediction analysis, see Advanced Evaluation. For log-level troubleshooting, see Training Logs.

Compare full SFT vs QLoRA results

After running both methods on the same dataset, compare them on your evaluation metrics (not just loss).

Factor

QLoRA advantage

Full SFT advantage

Training cost

5-10x cheaper compute credits

N/A

Training speed

2-3x faster wall-clock time

N/A

Accuracy ceiling

Within 1-3% for most tasks

Highest possible for novel domains

Overfitting risk

Lower (fewer trainable params)

Higher (all params updated)

Domain adaptation

Good for in-distribution tasks

Better for out-of-distribution images

The decision is task-specific. If QLoRA produces BERTScore F1 of 0.88 and full SFT produces 0.90, the 2-point difference may not justify 6x the cost. If QLoRA produces 0.72 and full SFT produces 0.85, the gap is worth it.

Full SFT Training Guide

When to use full fine-tuning

Hardware requirements

Full SFT VRAM estimates

Configure full SFT settings

Precision options

Gradient checkpointing

Multi-GPU configuration

Monitor full SFT training

Compare full SFT vs QLoRA results

Frequently asked questions

Next steps

QLoRA Training Guide

Start a Training Run

Improve Your Model