How Do LoRA and Quantization Work?

Learn how LoRA reduces training cost and how quantization shrinks model memory. Understand NF4, FP4, BFloat16, and when to use each in Datature Vi.

LoRA and quantization are two techniques that reduce the cost of fine-tuning a vision-language model. LoRA cuts training time and memory by updating only a small fraction of the model's parameters. Quantization reduces memory further by storing model weights in lower precision. Together (a combination called QLoRA), they let you fine-tune large models on GPUs that could not otherwise fit them.

This page explains how each technique works, when to use them, and what tradeoffs they introduce.


Why full fine-tuning is expensive

A VLM stores what it knows in parameters: numerical weights organized in large matrices. A 7B model has 7 billion of these weights. During full fine-tuning, the training process updates every single weight at every step. This requires:

  • GPU memory to hold all 7 billion weights, their gradients, and the optimizer states (which track how each weight has been changing). For a 7B model in Float16, that is roughly 28 GB for weights alone, plus 2-3x more for optimizer states.
  • Training time proportional to the number of weights being updated.

For many tasks, updating all 7 billion weights is unnecessary. The base model already knows how to process images and generate text. You only need it to learn your specific domain and task. LoRA exploits this by updating a much smaller set of parameters.


How LoRA works

LoRA (Low-Rank Adaptation) freezes all original model weights and inserts small, trainable matrices alongside them. Only these new matrices are updated during training.

The core idea

Each layer in a VLM has large weight matrices. During full fine-tuning, these entire matrices change. LoRA replaces each full-matrix update with two much smaller matrices that, when multiplied together, approximate the same change.

1

Freeze the original weights

The pre-trained model's weights are locked. They do not change during training. This means you do not need to store gradients or optimizer states for billions of parameters.

2

Add small adapter matrices

LoRA inserts two small matrices (called A and B) alongside each target layer. If the original weight matrix is 4096x4096, the LoRA matrices might be 4096x16 and 16x4096. The number 16 is the rank, and it controls how much capacity the adapter has.

3

Train only the adapters

During training, only the A and B matrices receive gradient updates. The total number of trainable parameters drops from billions to millions, typically a 95-99% reduction.

4

Combine at inference

At inference time, the adapter matrices are multiplied together and added to the original weights. The result behaves like a fully fine-tuned model, with no extra latency.

Research has shown that the weight changes needed for fine-tuning tend to be low-rank: they can be represented by a small number of dimensions rather than the full matrix. Think of it this way: the base model already understands general vision and language. Fine-tuning for your specific task (detecting defects, answering questions about medical images) requires adjusting the model's behavior in a relatively narrow direction. LoRA captures that narrow adjustment without changing everything else.

The rank parameter controls how many dimensions LoRA uses. A rank of 8-32 works well for most tasks. Higher ranks give more capacity but use more memory and train slower.

What LoRA changes in practice

Metric
Full fine-tuning
LoRA (rank 16)
Trainable parameters (7B model)
~7 billion
~20-50 million (0.3-0.7%)
GPU memory (7B model)
~56 GB+ (weights + optimizer)
~16-20 GB
Training speed
Baseline
2-3x faster
Final accuracy
Highest possible
Within 1-3% for most tasks
Inference speed
Baseline
Same (adapters merge into weights)

How quantization works

Quantization reduces the precision of model weights from 16-bit or 32-bit numbers to 4-bit numbers. This shrinks memory usage by approximately 4x.

Why precision matters

Numbers in a computer are stored with a fixed number of bits. More bits mean more precision:

  • Float32 (32 bits): High precision, large memory footprint. Each weight takes 4 bytes.
  • Float16 / BFloat16 (16 bits): Half the memory of Float32. BFloat16 has a wider range than Float16 and is more stable for training on modern GPUs.
  • NF4 / FP4 (4 bits): Quarter the memory of Float16. Each weight takes 0.5 bytes.

A 7B model in Float16 requires ~14 GB of GPU memory for weights alone. The same model in NF4 requires ~3.5 GB.

NF4 vs FP4

Datature Vi supports two 4-bit formats:

Format
Precision
Best for
Memory savings
NF4 (Normalized Float 4)
Optimized distribution for transformer weights
VLM training (recommended default)
~4x vs Float16
FP4 (4-bit Floating Point)
Standard 4-bit distribution
Fallback if NF4 causes issues
~4x vs Float16

NF4 is the recommended default. It was designed for transformer model weights, which follow a bell-curve distribution. NF4 maps its 16 possible values (4 bits = 2^4 = 16 levels) to match this distribution, preserving more information where it matters. FP4 uses a uniform distribution, which is less efficient for transformer weights but works as a fallback.

Precision types for training

Quantization applies to how weights are stored. Precision type applies to how calculations happen during training:

  • BFloat16 (recommended): Best stability on modern GPUs (NVIDIA Ampere and newer). Runs at roughly 2x the speed of Float32 with minimal accuracy loss.
  • Float16: Good for older GPUs without BFloat16 support. Slightly more prone to gradient instability with very large models.
  • Float32: Highest precision, slowest, highest memory usage. Use only for debugging numerical issues.

Quantization compresses information, so some loss is inevitable. In practice, the quality loss from NF4 quantization is small for most VLM tasks. You are more likely to see degradation when: (1) the task requires fine-grained numerical reasoning, (2) the model is already small (0.8B-2B) and has less redundancy to absorb the compression, or (3) you are pushing accuracy to its absolute maximum and even 1% matters.

For most users, the 4x memory savings far outweigh the minor quality reduction. Start with NF4 enabled and disable it only if you see quality issues you cannot solve through other means (more data, more epochs, different learning rate).


LoRA + quantization (QLoRA)

QLoRA combines both techniques. The base model weights are stored in 4-bit precision (quantized), and LoRA adapter matrices are trained in higher precision (BFloat16). This gives you the memory savings of both techniques simultaneously.

In Datature Vi, enabling LoRA training mode with NF4 quantization automatically uses QLoRA. There is no separate setting.

Memory comparison for a 7B model

Configuration
Approximate GPU memory
Full fine-tuning, Float16
56+ GB
Full fine-tuning, NF4 quantized
28+ GB
LoRA, Float16
16-20 GB
QLoRA (LoRA + NF4)
8-12 GB

QLoRA makes it possible to fine-tune a 7B model on a single GPU with 16 GB of VRAM, a configuration that would otherwise require 56+ GB.


Choosing your settings

1

Start with LoRA + NF4 (QLoRA)

This is the default in Datature Vi and works well for most tasks. It uses the least memory and trains the fastest.

2

Check your results

Train for a reasonable number of epochs and evaluate. If accuracy meets your target, you are done. Most users never need to change this.

3

Scale up if needed

If accuracy is insufficient, try these in order: (1) add more training data, (2) increase the number of epochs, (3) try a larger model architecture, (4) switch from LoRA to full fine-tuning.

NVILA-Lite exception

NVILA-Lite only supports full fine-tuning. LoRA is not available for this architecture. See Model Architectures for details on each architecture's supported training modes.


Frequently asked questions

No. After training, LoRA adapter weights are merged into the base model weights. The merged model runs at the same speed as a fully fine-tuned model. There is no additional latency at inference time.

Datature Vi trains one LoRA adapter per training run. Each run produces a model with the adapter already merged. If you need to adapt a model for a second task, start a new training run with a new dataset.

NF4 quantization typically has minimal impact on output quality for VLMs. The compression is well-suited to transformer weight distributions. You are more likely to see quality differences with very small models (under 2B parameters) or tasks requiring extreme precision. Try NF4 first and compare your evaluation metrics with and without it.

Quantization controls how model weights are stored in memory (4-bit vs 16-bit). Precision type controls how math operations are performed during training (BFloat16, Float16, or Float32). You can quantize weights to NF4 while running training calculations in BFloat16. This combination is the recommended default.

Switch to full fine-tuning when: (1) LoRA accuracy plateaus below your target after trying more data and more epochs, (2) your images are very different from typical photos (microscopy, satellite, X-ray, spectrograms), or (3) you need absolute maximum accuracy and have the GPU resources. For most tasks, LoRA results are within 1-3% of full fine-tuning.


Related resources