QLoRA Training Guide

On this page

When to use QLoRA Configure quantization How QLoRA works under the hood VRAM requirements Troubleshooting FAQ

Before you start

A workflow open in the workflow canvas with a model architecture selected
Familiarity with how LoRA and quantization work
An understanding of your GPU resources

QLoRA combines LoRA (Low-Rank Adaptation) with NF4 quantization. The base model weights are stored in 4-bit precision to save memory, while small adapter matrices train in BF16 for full gradient precision. This gives you the memory savings of both techniques at once, letting you fine-tune a 7B model on a single T4 GPU (16 GB VRAM).

Datature Vi handles LoRA rank, alpha, target modules, and learning rate scheduling automatically. The one setting you control is the quantization format: NF4 or FP4.

When to use QLoRA

QLoRA is the recommended default for most Datature Vi training runs. It covers the widest range of model sizes on the most affordable hardware.

Method

Trainable params

Min VRAM (7B)

Training speed

Best for

QLoRA (LoRA + NF4)

0.1-1%

~7 GB

Fastest

Most tasks, limited VRAM, prototyping

LoRA (FP16)

0.1-1%

~12 GB

Fast

When NF4 quality loss is unacceptable

Full SFT

100%

~28 GB

Slowest

Maximum accuracy with sufficient data and compute

Start with QLoRA. Move to LoRA without quantization only if you measure quality degradation on your specific task. Move to full SFT only if LoRA results plateau and you have the GPU budget. See the Full SFT Training Guide for that path.

Configure quantization

In the workflow canvas, click the Model node. Under Quantization, select the format:

NF4 (Normalized Float 4) -- the recommended default. NF4 maps its 16 possible values to match the bell-curve distribution of transformer weights, preserving more information where it matters.
FP4 (4-bit Floating Point) -- a standard 4-bit format with uniform distribution. Use as a fallback if NF4 causes numerical stability issues with a specific model architecture.

Both formats reduce memory by roughly 4x compared to FP16. With either enabled and LoRA selected as training mode, Datature Vi automatically configures QLoRA: base weights in 4-bit, adapter matrices in BF16.

See LoRA and Quantization for a deeper explanation of NF4 vs FP4 and when quantization affects quality.

Disabling quantization

To run standard LoRA without quantization, disable the quantization toggle in the Model node. This doubles the memory needed for base weights (FP16 instead of 4-bit) but preserves full weight precision. Only do this if you confirm through evaluation that NF4 causes measurable quality loss on your task.

How QLoRA works under the hood

Datature Vi configures LoRA internals automatically based on your selected model architecture and size. Understanding what happens behind the scenes helps you interpret training behavior and troubleshoot issues.

Adapter rank and alpha

LoRA inserts small trainable matrices (adapters) into specific model layers. The rank controls adapter capacity: how many dimensions the adapter uses to capture fine-tuning changes. Higher ranks give more capacity but use more memory. The alpha parameter scales the adapter's contribution to the final output.

Datature Vi selects rank and alpha values tuned for each architecture and model size. You do not need to configure these manually.

Target modules

LoRA adapters are inserted into specific transformer layers. The default targets are the attention projection layers (q_proj, k_proj, v_proj, o_proj), which control how the model relates different parts of image and text inputs. For some architectures, Datature Vi also targets feed-forward layers (gate_proj, up_proj, down_proj) when the model size and task complexity warrant it.

Learning rate scheduling

Datature Vi uses cosine annealing with warmup as the default learning rate schedule. The learning rate rises from near-zero during a warmup phase, then smoothly decays following a cosine curve. This gives the model time to make large updates early (when there is the most to learn) and small, precise updates later (when fine-tuning is finishing).

The learning rate range, warmup duration, and schedule type are set automatically based on the model architecture and training mode.

VRAM requirements

QLoRA VRAM estimates

Model

Quantization

Estimated VRAM

Recommended GPU

Qwen 3B

NF4

~4 GB

T4 (16 GB)

Qwen 7B

NF4

~7 GB

T4 (16 GB)

Qwen 7B

FP16 (no quantization)

~12 GB

T4 or L4

InternVL 8B

FP16 (no quantization)

~14 GB

T4 or L4

Qwen 32B

NF4

~18 GB

L4 or A10

Qwen 32B

FP16 (no quantization)

~48 GB

A100 (80 GB)

Qwen 72B

NF4

~36 GB

A100 (80 GB)

These estimates include model weights, adapter parameters, optimizer states, and activation memory with gradient checkpointing enabled. Actual usage varies by 10-15% depending on batch size and sequence length. Leave 20-30% headroom above the estimate when selecting your GPU.

For a detailed VRAM estimation formula and full GPU comparison, see GPU and Compute Resources Guide.

QLoRA Training Guide

When to use QLoRA

Configure quantization

How QLoRA works under the hood

Adapter rank and alpha

Target modules

Learning rate scheduling

VRAM requirements

QLoRA VRAM estimates

Troubleshooting QLoRA training

Frequently asked questions

Next steps

Full SFT Training Guide

Start a Training Run

Training Metrics