How Do LoRA and Quantization Work?
Learn how LoRA reduces training cost and how quantization shrinks model memory. Understand NF4, FP4, BFloat16, and when to use each in Datature Vi.
LoRA and quantization are two techniques that reduce the cost of fine-tuning a vision-language model. LoRA cuts training time and memory by updating only a small fraction of the model's parameters. Quantization reduces memory further by storing model weights in lower precision. Together (a combination called QLoRA), they let you fine-tune large models on GPUs that could not otherwise fit them.
This page explains how each technique works, when to use them, and what tradeoffs they introduce.
Why full fine-tuning is expensive
A VLM stores what it knows in parameters: numerical weights organized in large matrices. A 7B model has 7 billion of these weights. During full fine-tuning, the training process updates every single weight at every step. This requires:
- GPU memory to hold all 7 billion weights, their gradients, and the optimizer states (which track how each weight has been changing). For a 7B model in Float16, that is roughly 28 GB for weights alone, plus 2-3x more for optimizer states.
- Training time proportional to the number of weights being updated.
For many tasks, updating all 7 billion weights is unnecessary. The base model already knows how to process images and generate text. You only need it to learn your specific domain and task. LoRA exploits this by updating a much smaller set of parameters.
How LoRA works
LoRA (Low-Rank Adaptation) freezes all original model weights and inserts small, trainable matrices alongside them. Only these new matrices are updated during training.
The core idea
Each layer in a VLM has large weight matrices. During full fine-tuning, these entire matrices change. LoRA replaces each full-matrix update with two much smaller matrices that, when multiplied together, approximate the same change.
Freeze the original weights
The pre-trained model's weights are locked. They do not change during training. This means you do not need to store gradients or optimizer states for billions of parameters.
Add small adapter matrices
LoRA inserts two small matrices (called A and B) alongside each target layer. If the original weight matrix is 4096x4096, the LoRA matrices might be 4096x16 and 16x4096. The number 16 is the rank, and it controls how much capacity the adapter has.
Train only the adapters
During training, only the A and B matrices receive gradient updates. The total number of trainable parameters drops from billions to millions, typically a 95-99% reduction.
Combine at inference
At inference time, the adapter matrices are multiplied together and added to the original weights. The result behaves like a fully fine-tuned model, with no extra latency.
What LoRA changes in practice
How quantization works
Quantization reduces the precision of model weights from 16-bit or 32-bit numbers to 4-bit numbers. This shrinks memory usage by approximately 4x.
Why precision matters
Numbers in a computer are stored with a fixed number of bits. More bits mean more precision:
- Float32 (32 bits): High precision, large memory footprint. Each weight takes 4 bytes.
- Float16 / BFloat16 (16 bits): Half the memory of Float32. BFloat16 has a wider range than Float16 and is more stable for training on modern GPUs.
- NF4 / FP4 (4 bits): Quarter the memory of Float16. Each weight takes 0.5 bytes.
A 7B model in Float16 requires ~14 GB of GPU memory for weights alone. The same model in NF4 requires ~3.5 GB.
NF4 vs FP4
Datature Vi supports two 4-bit formats:
NF4 is the recommended default. It was designed for transformer model weights, which follow a bell-curve distribution. NF4 maps its 16 possible values (4 bits = 2^4 = 16 levels) to match this distribution, preserving more information where it matters. FP4 uses a uniform distribution, which is less efficient for transformer weights but works as a fallback.
Precision types for training
Quantization applies to how weights are stored. Precision type applies to how calculations happen during training:
- BFloat16 (recommended): Best stability on modern GPUs (NVIDIA Ampere and newer). Runs at roughly 2x the speed of Float32 with minimal accuracy loss.
- Float16: Good for older GPUs without BFloat16 support. Slightly more prone to gradient instability with very large models.
- Float32: Highest precision, slowest, highest memory usage. Use only for debugging numerical issues.
LoRA + quantization (QLoRA)
QLoRA combines both techniques. The base model weights are stored in 4-bit precision (quantized), and LoRA adapter matrices are trained in higher precision (BFloat16). This gives you the memory savings of both techniques simultaneously.
In Datature Vi, enabling LoRA training mode with NF4 quantization automatically uses QLoRA. There is no separate setting.
QLoRA makes it possible to fine-tune a 7B model on a single GPU with 16 GB of VRAM, a configuration that would otherwise require 56+ GB.
Choosing your settings
Start with LoRA + NF4 (QLoRA)
This is the default in Datature Vi and works well for most tasks. It uses the least memory and trains the fastest.
Check your results
Train for a reasonable number of epochs and evaluate. If accuracy meets your target, you are done. Most users never need to change this.
Scale up if needed
If accuracy is insufficient, try these in order: (1) add more training data, (2) increase the number of epochs, (3) try a larger model architecture, (4) switch from LoRA to full fine-tuning.
NVILA-Lite only supports full fine-tuning. LoRA is not available for this architecture. See Model Architectures for details on each architecture's supported training modes.
Frequently asked questions
Related resources
Model Settings
Configure training mode, quantization, and hyperparameters.
Model Architectures
Compare available VLM architectures and their supported training modes.
How Does VLM Training Work?
Epochs, batch size, learning rate, and loss curves.
What Are VLMs?
How vision-language models combine image understanding with language generation.
How Does Inference Work?
What happens when your trained model generates a response.
Resource Usage
GPU memory specifications and compute credit costs.
Updated about 1 month ago
