How Do I Choose the Right GPU?
Estimate VRAM requirements for your model, compare GPU tiers, and decide when to scale to multiple GPUs for VLM training in Datature Vi.
Choosing the right GPU depends on three things: your model size, your training method, and your budget. A 7B model with QLoRA fits on a single T4. The same model with full fine-tuning needs an A100. This page gives you the formulas, reference tables, and decision guides to pick the right hardware before you launch a training run.
How much VRAM does my model need?
VRAM consumption during training comes from four sources: model weights, optimizer states, gradients, and activation memory. The total depends on your model size, training method, and precision format.
VRAM estimation formula
For LoRA with FP16: Model parameters x 2 bytes (FP16 weights) + small adapter overhead. Optimizer states and gradients apply only to the adapter parameters (0.1-1% of total).
For QLoRA (LoRA + NF4): Model parameters x 0.5 bytes (NF4 weights) + adapter overhead in BF16. Roughly 4x less memory than FP16 LoRA for the base weights.
For Full SFT with FP16: Model parameters x 2 bytes (weights) + parameters x 2 bytes (gradients) + parameters x 8 bytes (AdamW optimizer states). Total is roughly 12 bytes per parameter, or about 2-3x the weight-only size after gradient checkpointing.
Actual usage varies by 10-15% depending on batch size, sequence length, and whether Flash Attention 2 is enabled. These estimates include gradient checkpointing.
Quick reference table
Datature Vi checks your configuration against the selected GPU's VRAM limit before launching a training run. If the estimated memory exceeds available VRAM, you will see a warning with a suggestion to switch GPUs or enable quantization.
GPU tiers available in Datature Vi
Each GPU tier has a compute credit multiplier that affects cost. See Resource Usage for multiplier values and pricing details.
When to scale to multiple GPUs
Single vs multi-GPU decision guide
How multi-GPU training works in Datature Vi
Datature Vi handles multi-GPU orchestration automatically. You select the GPU type and count; the platform configures the rest.
- Automatic parallelism strategy: Vi selects the right combination of data parallelism and model parallelism based on your model size and GPU count.
- NVLink interconnect: H100 GPUs use NVLink for high-bandwidth gradient synchronization. The platform configures topology automatically.
- Per-epoch checkpointing: Checkpoints are saved periodically so you can resume if a run is interrupted.
- VRAM-aware scheduling: The platform estimates memory before launch and warns you if your configuration exceeds available VRAM.
- Async execution: Training runs execute in the background. You receive a notification when the run completes.
Memory optimization with NF4 quantization
NF4 (Normalized Float 4) quantization stores model weights in 4-bit precision, cutting memory by roughly 4x compared to FP16. When combined with LoRA (a combination called QLoRA), it lets you train large models on smaller GPUs.
For a 7B model, NF4 reduces weight memory from ~14 GB to ~3.5 GB. With LoRA adapters in BF16, total training memory drops to roughly 7 GB, fitting comfortably on a T4 (16 GB).
For more on how quantization works and when it affects quality, see LoRA and Quantization.
Cost management tips
GPU time is the primary cost driver for VLM training. A few choices have outsized impact on your compute credit consumption.
Right-size your GPU
Use the VRAM estimation table above to pick the smallest GPU that fits your model and training method. Running QLoRA on a T4 costs 1x credits per minute; the same run on an A100 costs 6x.
Start with QLoRA
QLoRA is the default in Datature Vi for a reason. It uses the least memory, trains the fastest, and produces results within 1-3% of full SFT for most tasks. Switch to full SFT only after confirming QLoRA results fall short.
Watch for early stopping signals
Monitor your validation loss curve during training. If validation loss stops improving for several epochs, the remaining epochs burn credits without improving the model. Kill the run early and save the best checkpoint.
Test with small batches first
Run a short experiment (5-10 epochs on 50-100 images) on a cheap GPU before committing to a large production run. This validates your configuration and catches issues before they consume significant credits.
For detailed cost calculations and credit multipliers per GPU tier, see Resource Usage.
Frequently asked questions
Related resources
Resource Usage
Monitor compute credit consumption and GPU multiplier rates.
Start a Training Run
Select GPU hardware and launch training.
LoRA and Quantization
How LoRA and NF4 reduce training cost and memory.
Model Settings
Configure training mode, hyperparameters, and precision.
Data Rows and Compute Credits
How the two resource currencies work in Datature Vi.
Model Architectures
Compare the available VLM architectures and their sizes.
Updated 4 days ago
