How Do LoRA and Quantization Work?

LoRA and quantization are two techniques that reduce the cost of fine-tuning a vision-language model. LoRA cuts training time and memory by updating only a small fraction of the model's parameters. Quantization reduces memory further by storing model weights in lower precision. Together (a combination called QLoRA), they let you fine-tune large models on GPUs that could not otherwise fit them.

This page explains how each technique works, when to use them, and what tradeoffs they introduce.

On this page

Why full fine-tuning is expensive How LoRA works How quantization works LoRA + quantization (QLoRA)Choosing your settings FAQ Related resources

Why full fine-tuning is expensive

A VLM stores what it knows in parameters: numerical weights organized in large matrices. A 7B model has 7 billion of these weights. During full fine-tuning, the training process updates every single weight at every step. This requires:

GPU memory to hold all 7 billion weights, their gradients, and the optimizer states (which track how each weight has been changing). For a 7B model in Float16, that is roughly 28 GB for weights alone, plus 2-3x more for optimizer states.
Training time proportional to the number of weights being updated.

For many tasks, updating all 7 billion weights is unnecessary. The base model already knows how to process images and generate text. You only need it to learn your specific domain and task. LoRA exploits this by updating a much smaller set of parameters.

How LoRA works

LoRA (Low-Rank Adaptation) freezes all original model weights and inserts small, trainable matrices alongside them. Only these new matrices are updated during training.

The core idea

Each layer in a VLM has large weight matrices. During full fine-tuning, these entire matrices change. LoRA replaces each full-matrix update with two much smaller matrices that, when multiplied together, approximate the same change.

What LoRA changes in practice

Metric

Full fine-tuning

LoRA (rank 16)

How quantization works

Quantization reduces the precision of model weights from 16-bit or 32-bit numbers to 4-bit numbers. This shrinks memory usage by approximately 4x.

Why precision matters

Numbers in a computer are stored with a fixed number of bits. More bits mean more precision:

Float32 (32 bits): High precision, large memory footprint. Each weight takes 4 bytes.
Float16 / BFloat16 (16 bits): Half the memory of Float32. BFloat16 has a wider range than Float16 and is more stable for training on modern GPUs.
NF4 / FP4 (4 bits): Quarter the memory of Float16. Each weight takes 0.5 bytes.

A 7B model in Float16 requires ~14 GB of GPU memory for weights alone. The same model in NF4 requires ~3.5 GB.

NF4 vs FP4

Datature Vi supports two 4-bit formats:

Format

Precision

Best for

Memory savings

NF4 is the recommended default. It was designed for transformer model weights, which follow a bell-curve distribution. NF4 maps its 16 possible values (4 bits = 2^4 = 16 levels) to match this distribution, preserving more information where it matters. FP4 uses a uniform distribution, which is less efficient for transformer weights but works as a fallback.

Precision types for training

Quantization applies to how weights are stored. Precision type applies to how calculations happen during training:

BFloat16 (recommended): Best stability on modern GPUs (NVIDIA Ampere and newer). Runs at roughly 2x the speed of Float32 with minimal accuracy loss.
Float16: Good for older GPUs without BFloat16 support. Slightly more prone to gradient instability with very large models.
Float32: Highest precision, slowest, highest memory usage. Use only for debugging numerical issues.

LoRA + quantization (QLoRA)

QLoRA combines both techniques. The base model weights are stored in 4-bit precision (quantized), and LoRA adapter matrices are trained in higher precision (BFloat16). This gives you the memory savings of both techniques simultaneously.

In Datature Vi, enabling LoRA training mode with NF4 quantization automatically uses QLoRA. There is no separate setting.

Memory comparison for a 7B model

Configuration

Approximate GPU memory

QLoRA makes it possible to fine-tune a 7B model on a single GPU with 16 GB of VRAM, a configuration that would otherwise require 56+ GB.

Choosing your settings

NVILA-Lite exception

NVILA-Lite only supports full fine-tuning. LoRA is not available for this architecture. See Model Architectures for details on each architecture's supported training modes.