QLoRA Training Guide

Understand how QLoRA works in Datature Vi and configure quantization settings for memory-efficient VLM fine-tuning.

Before you start

QLoRA combines LoRA (Low-Rank Adaptation) with NF4 quantization. The base model weights are stored in 4-bit precision to save memory, while small adapter matrices train in BF16 for full gradient precision. This gives you the memory savings of both techniques at once, letting you fine-tune a 7B model on a single T4 GPU (16 GB VRAM).

Datature Vi handles LoRA rank, alpha, target modules, and learning rate scheduling automatically. The one setting you control is the quantization format: NF4 or FP4.


When to use QLoRA

QLoRA is the recommended default for most Datature Vi training runs. It covers the widest range of model sizes on the most affordable hardware.

Method
Trainable params
Min VRAM (7B)
Training speed
Best for
QLoRA (LoRA + NF4)
0.1-1%
~7 GB
Fastest
Most tasks, limited VRAM, prototyping
LoRA (FP16)
0.1-1%
~12 GB
Fast
When NF4 quality loss is unacceptable
Full SFT
100%
~28 GB
Slowest
Maximum accuracy with sufficient data and compute

Start with QLoRA. Move to LoRA without quantization only if you measure quality degradation on your specific task. Move to full SFT only if LoRA results plateau and you have the GPU budget. See the Full SFT Training Guide for that path.


Configure quantization

In the workflow canvas, click the Model node. Under Quantization, select the format:

  • NF4 (Normalized Float 4) -- the recommended default. NF4 maps its 16 possible values to match the bell-curve distribution of transformer weights, preserving more information where it matters.
  • FP4 (4-bit Floating Point) -- a standard 4-bit format with uniform distribution. Use as a fallback if NF4 causes numerical stability issues with a specific model architecture.

Both formats reduce memory by roughly 4x compared to FP16. With either enabled and LoRA selected as training mode, Datature Vi automatically configures QLoRA: base weights in 4-bit, adapter matrices in BF16.

See LoRA and Quantization for a deeper explanation of NF4 vs FP4 and when quantization affects quality.

Disabling quantization

To run standard LoRA without quantization, disable the quantization toggle in the Model node. This doubles the memory needed for base weights (FP16 instead of 4-bit) but preserves full weight precision. Only do this if you confirm through evaluation that NF4 causes measurable quality loss on your task.


How QLoRA works under the hood

Datature Vi configures LoRA internals automatically based on your selected model architecture and size. Understanding what happens behind the scenes helps you interpret training behavior and troubleshoot issues.

Adapter rank and alpha

LoRA inserts small trainable matrices (adapters) into specific model layers. The rank controls adapter capacity: how many dimensions the adapter uses to capture fine-tuning changes. Higher ranks give more capacity but use more memory. The alpha parameter scales the adapter's contribution to the final output.

Datature Vi selects rank and alpha values tuned for each architecture and model size. You do not need to configure these manually.

A rank of 8-32 works well for most VLM tasks. Lower ranks (4-8) are sufficient for simple tasks like binary classification or short-answer VQA. Higher ranks (32-64) help with complex tasks that need fine-grained domain adaptation, such as detailed medical image captioning.

The effective scaling factor is alpha / rank. A common pattern is alpha = 2x rank. This balance keeps the adapter's influence proportional to its capacity.

Target modules

LoRA adapters are inserted into specific transformer layers. The default targets are the attention projection layers (q_proj, k_proj, v_proj, o_proj), which control how the model relates different parts of image and text inputs. For some architectures, Datature Vi also targets feed-forward layers (gate_proj, up_proj, down_proj) when the model size and task complexity warrant it.

Learning rate scheduling

Datature Vi uses cosine annealing with warmup as the default learning rate schedule. The learning rate rises from near-zero during a warmup phase, then smoothly decays following a cosine curve. This gives the model time to make large updates early (when there is the most to learn) and small, precise updates later (when fine-tuning is finishing).

The learning rate range, warmup duration, and schedule type are set automatically based on the model architecture and training mode.


VRAM requirements

QLoRA VRAM estimates

Model
Quantization
Estimated VRAM
Recommended GPU
Qwen 3B
NF4
~4 GB
T4 (16 GB)
Qwen 7B
NF4
~7 GB
T4 (16 GB)
Qwen 7B
FP16 (no quantization)
~12 GB
T4 or L4
InternVL 8B
FP16 (no quantization)
~14 GB
T4 or L4
Qwen 32B
NF4
~18 GB
L4 or A10
Qwen 32B
FP16 (no quantization)
~48 GB
A100 (80 GB)
Qwen 72B
NF4
~36 GB
A100 (80 GB)

These estimates include model weights, adapter parameters, optimizer states, and activation memory with gradient checkpointing enabled. Actual usage varies by 10-15% depending on batch size and sequence length. Leave 20-30% headroom above the estimate when selecting your GPU.

For a detailed VRAM estimation formula and full GPU comparison, see GPU and Compute Resources Guide.


Troubleshooting QLoRA training

The model and batch data exceed your GPU's VRAM. Fix it in this order:

  1. Reduce batch size to 1 or 2. Use gradient accumulation to maintain effective batch size.
  2. Confirm NF4 is enabled. Without quantization, memory usage roughly doubles.
  3. Try a smaller model architecture. Dropping from 7B to 3B cuts memory significantly.
  4. Switch to a GPU with more VRAM. See the GPU tiers table.

Sudden jumps in training loss usually indicate instability in the learning process. Try:

  1. Lower the learning rate. Reduce it by half in Model Settings.
  2. Reduce batch size. Smaller batches can stabilize gradient updates.
  3. Check your annotations. Inconsistent labels send conflicting signals that destabilize training.

NF4 quantization causes minor quality loss (typically under 0.5%). If you see a meaningful drop:

  1. Try FP4 instead of NF4. Some model architectures respond differently to each format.
  2. Disable quantization and run LoRA in FP16. This uses more memory but preserves full weight precision.
  3. Add more training data. More diverse examples help the model learn robust patterns despite lower weight precision.
  4. Try a larger model. Larger models have more redundancy and absorb quantization noise better.

If loss decreases very gradually over many epochs:

  1. Increase the learning rate. Try doubling it in Model Settings.
  2. Add more epochs. The model may need more passes through your data.
  3. Check your annotations. Inconsistent or low-quality annotations slow convergence because the model receives conflicting signals.

Frequently asked questions

LoRA inserts small trainable adapter matrices into a frozen base model. QLoRA does the same thing but also quantizes the frozen base weights to NF4 (4-bit), cutting memory by roughly 4x. In Datature Vi, enabling LoRA with NF4 quantization automatically uses QLoRA. There is no separate toggle.

Not directly. Datature Vi selects rank, alpha, and target modules automatically based on your model architecture and size. These defaults are tuned for each supported architecture. If you need more control, the Vi SDK allows programmatic configuration of training flows.

QLoRA works with all architectures in Datature Vi except NVILA-Lite, which only supports full fine-tuning. See Model Architectures for each architecture's supported training modes.

For most tasks, yes. NF4 provides significant memory savings with minimal quality impact. Disable NF4 only if you confirm through evaluation that FP16 LoRA produces measurably better results on your specific task and you have the GPU resources for it.

Start with NF4. It is designed for transformer weight distributions and preserves more information than FP4 for VLMs. Switch to FP4 only if you encounter numerical stability issues (loss spikes, NaN values) that do not resolve with other troubleshooting steps.


Next steps

Full SFT Training Guide

When and how to use full fine-tuning for maximum accuracy.

Start a Training Run

Select GPU hardware and launch your configured workflow.

Training Metrics

Read loss curves, F1, IoU, BLEU, and BERTScore on the run dashboard.